Mining Big Data: An Overview

“Big Data” has become one of the hottest topics in both information science and corporate settings alike. As the internet and its amount of users has grown exponentially in a short amount of time, the result is big data coming at us from all sides with an enormous amount of opportunities to use it. Information professionals are trying to wrangle it while companies are wanting to leverage it. Opportunities abound and the big players like Google and Amazon are already all over it.

So how do we go about “mining” big data? How do we not lose opportunities and golden insights hidden within the patterns, trends, and flows of all the data we are collecting? While mining big data is primarily in the purview of IT, we as information professionals, business analysts, and executive stakeholders need to have a good working knowledge of what tools are available to glean patterns, trends, and tendencies of user groups and create solutions accordingly.

Managing the size

The first thing to consider is that the data we are dealing with is big. It’s VERY big. No single system or process can handle it all at once, so smart developers have created methods that use multiples systems and multiple times, as well as other methods of compressing the data to get results as fast as possible. In addition, IT processes like Map Reduce greatly reduce the load for individual machines processing the data, allowing for multiple partitions to function simultaneously. Starting with the high-level “map” or apply-to-all function, these processes reduce the map for faster analysis. For example, Local Sensitivity Hashing, aka LSH, reduces datasets by compressing them based on probable characteristics of the set. By mapping similar items to the same data “buckets”, the time spent to analyze big datasets is greatly reduced, which also reduces the cost. A subtype of LSH worth noting is called “min-hashing”, a form of LSH that looks specifically for similar items by using independent permutations of LSH.

Two other reduction techniques that are similar to LSH are data clustering and nearest neighbor learning. Both of these techniques also make use of similarity groupings. Instead of hashing the data and using dimension reduction, these techniques use a more qualitative approach by grouping similar items together. By clustering the data and graphing it patterns can be seen and generalizations be made about the data in a relatively short period of time. Nearest neighbor learning performs a similar technique, but instead of grouping the data pre-coordinate it queries the data post-coordinate in order to find the best possible matches. During this process the machine “learns” along the way, weighing neighbors based on “data distance”, e.g. how similar or different they are.

Analyzing with decision trees

Another technique used to analyze massive datasets is decision trees. These trees create a series of hierarchical steps to split and group data based on defined or learned characteristics, until terminal nodes are reached. Like a taxonomy on the fly, the data is quickly reduced and analyzed by way of a succession of decisions that create hierarchical trees. This method is used to predict future patterns within the data based on past ones.

Hadoop

A popular technical data mining process worth noting, that has emerged in recent years, is Hadoop. Based on the theoretical methods described above, Hadoop uses Map Reduce to parse and group the data to reduce is dimensionality. Using distributed file systems, aka DFS, Hadoop can quickly process large amounts of data while simultaneously accounting for the inherent and inevitable failures that occur during these kinds of massive processes. Using clustered nodes, Hadoop splits the data into blocks and is cumulous while data is aggregated from multiple sources. While Hadoop is more of a technical process than a theoretical process, it is worth noting as it has been adopted by many major data consuming companies as a technical solution to the inherent problems of mining big data.

Hubs and Authorities

This brings us to mining big data on the World Wide Web. The Web is by far the largest big dataset of them all, and specific techniques have been in development since the beginning of the Web to effectively analyze everything being published on the Web. The primary technique being used to wrangle and group large amounts of Web content is a technique called Hubs and Authorities. This technique looks not only at the page’s overall popularity and its inbound/out bound links, but where it ranks within a given topic. This way relevant pages can be presented to specific users based on their interests and search history, helping to disambiguate the increasing amount of content available on the Web.

Bad data in the real world

And finally, in the real world we always must consider spam when mining big data. Each of the techniques described above before Web mining techniques primarily deal with internal data, such as documents, articles, emails, and user data. But, of course, in order to properly mine data on the Web one must consider the problem of spam. Fortunately, developers have developed techniques that account for spam.

“Page rank” is an idea we are all somewhat familiar with, but in today’s Web environment search engines (which are also page indexers) have learned what techniques spammers use to boost their pages artificially and have adapted their data mining techniques. Before any Webpages are ranked, they are analyzed and given a “trust rank score” based on their spam properties. Similar to email spam filters, Web crawlers look for specific properties and structures commonly found in spam pages and score the page accordingly.

So while there is a lot of heavy duty mathematics behind these data mining techniques that was not discussed here, we can still understand the basics to know what’s available to us as stakeholders of data. Using these techniques we can discover patterns and design our solutions accordingly.