Growth Series BLOG

Its a Premium BLOG template and it contains Instagram Feed, Twitter Feed, Subscription Form, Blog Search, Image CTA, Topic filter and Recent Post.

All Posts

Mining Big Data: An Overview

“Big Data” has become one of the hottest topics in both information science and corporate settings alike.  As the internet and its amount of users has grown exponentially in a short amount of time, the result is big data coming at us from all sides with an enormous amount of opportunities to use it.  Information professionals are trying to wrangle it while companies are wanting to leverage it.  Opportunities abound and the big players like Google and Amazon are already all over it. 

So how do we go about “mining” big data?  How do we not lose opportunities and golden insights hidden within the patterns, trends, and flows of all the data we are collecting?  While mining big data is primarily in the purview of IT, we as information professionals, business analysts, and executive stakeholders need to have a good working knowledge of what tools are available to glean patterns, trends, and tendencies of user groups and create solutions accordingly.

The first thing to consider is that the data we are dealing with is big.  It’s VERY big.  No single system or process can handle it all at once, so smart developers have created methods that use multiples systems and multiple times, as well as other methods of compressing the data to get results as fast as possible.  In addition, IT processes like map reduce greatly reduce the load for individual machines processing the data, allowing for multiple partitions to function simultaneously.  Starting with the high-level “map” or apply-to-all function, these processes reduce the map for faster analysis. For example, Local Sensitivity Hashing, aka LSH, reduces datasets by compressing them based on probable characteristics of the set.  By mapping similar items to the same data “buckets”, the time spent to analyze big datasets is greatly reduced, which also reduces the cost.  A subtype of LSH worth noting is called “min-hashing”, a form of LSH that looks specifically for similar items by using independent permutations of LSH. 

Two other reduction techniques that are similar to LSH are data clustering and nearest neighbor learning.  Both of these techniques also make use of similarity groupings.  Instead of hashing the data and using dimension reduction, these techniques use a more qualitative approach by grouping similar items together.  By clustering the data and graphing it patterns can be seen and generalizations be made about the data in a relatively short period of time.  Nearest neighbor learning performs a similar technique, but instead of grouping the data pre-coordinate it queries the data post-coordinate in order to find the best possible matches. During this process the machine “learns” along the way, weighing neighbors based on “data distance”, e.g. how similar or different they are. 

Another technique used to analyze massive datasets is decision trees.  These trees create a series of hierarchical steps to split and group data based on defined or learned characteristics, until terminal nodes are reached.  Like a taxonomy on the fly, the data is quickly reduced and analyzed by way of a succession of decisions that create hierarchical trees.   This method is used to predict future patterns within the data based on past ones. 

A popular technical data mining process worth noting, that has emerged in recent years, is Hadoop. Based on the theoretical methods described above, Hadoop uses Map Reduce to parse and group the data to reduce is dimensionality.   Using distributed file systems, aka DFS, Hadoop can quickly process large amounts of data while simultaneously accounting for the inherent and inevitable failures that occur during these kinds of massive processes.  Using clustered nodes, Hadoop splits the data into blocks and is cumulous while data is aggregated from multiple sources.  While Hadoop is more of a technical process than a theoretical process, it is worth noting as it has been adopted by many major data consuming companies as a technical solution to the inherent problems of mining big data.

This brings us to mining big data on the World Wide Web.  The Web is by far the largest big dataset of them all, and specific techniques have been in development since the beginning of the Web to effectively analyze everything being published on the Web.   The primary technique being used to wrangle and group large amounts of Web content is a technique called Hubs and Authorities.  This technique looks not only at the page’s overall popularity and its inbound/out bound links, but where it ranks within a given topic.  This way relevant pages can be presented to specific users based on their interests and search history, helping to disambiguate the increasing amount of content available on the Web. 

And finally, in the real world we always must consider spam when mining big data.  Each of the techniques described above before Web mining techniques primarily deal with internal data, such as documents, articles, emails, and user data.  But, of course, in order to properly mine data on the Web one must consider the problem of spam.  Fortunately, developers have developed techniques that account for spam. 

“Page rank” is an idea we are all somewhat familiar with, but in today’s Web environment search engines (which are also page indexers) have learned what techniques spammers use to boost their pages artificially and have adapted their data mining techniques.  Before any Webpages are ranked, they are analyzed and given a “trust rank score” based on their spam properties.   Similar to email spam filters, Web crawlers look for specific properties and structures commonly found in spam pages and score the page accordingly. 

So while there is a lot of heavy duty mathematics behind these data mining techniques that was not discussed here, we can still understand the basics to know what’s available to us as stakeholders of data.  Using these techniques we can discover patterns and design our solutions accordingly. 

Seth Earley
Seth Earley
Seth Earley is the Founder & CEO of Earley Information Science and the author of the award winning book The AI-Powered Enterprise: Harness the Power of Ontologies to Make Your Business Smarter, Faster, and More Profitable. An expert with 20+ years experience in Knowledge Strategy, Data and Information Architecture, Search-based Applications and Information Findability solutions. He has worked with a diverse roster of Fortune 1000 companies helping them to achieve higher levels of operating performance.

Recent Posts

Designing AI Programs for Success - a 4 Part Series

Recorded - available as on demand webcast AI is plagued by inflated and unrealistic expectations due to a lack of broad understanding of this wide-ranging space by software vendors and customers. Software tools can be extremely powerful, however the services, infrastructure, data quality, architecture, talent and methodologies to fully deploy in the enterprise are frequently lacking. This four-part series by Earley Information Science and Pandata will explore a number of issues that continue to afflict AI projects and reduce the likelihood of success. The sessions will provide actionable steps using proven processes to improve AI program outcomes.

The Missing Ingredient to Digital Transformation: Scaling Knowledge Communities and Processes

The holy grail of digital transformation is the seemingly conflicting goals of high levels of customer service and pressure to reduce costs. “Digital Transformation” has become an all-encompassing term – in a piece in this column about customer data platforms, I asked whether the term has lost its meaning: The phrase “digital transformation” can mean anything and everything — tools, technology, business processes, customer experience, or artificial intelligence, and every buzzword that marketers can come up with. Definitions from analysts and vendors include IT modernization and putting services online; developing new business models; taking a “digital first” approach; and creating new business processes, and customer experiences. The overarching objective of a digital transformation program is to improve end-to-end efficiencies, remove friction from information flows, and create new value streams that differentiate a company’s offerings and strengthen the customer relationship. Having assisted large global enterprises with building the data architecture, supporting processes, and governance for multiple digital transformations, in my experience, there are two broad classes of initiatives that seem to get funding and others that miss the boat in terms of time, attention, and resources.

4 Reasons B2B Manufacturers need Strong Product Data

There are many manufacturers who have started to take the leap forward in the digital space, but there are still a great number who rely solely on their distributors to manage their product data. We are going to look at 4 key reasons why its so important that manufacturers own their product and dedicate the time and resources to build it out.