Data mining research themes have been evolving. Today’s subjects are driven by two main issues. The first deals with the integration of new technical possibilities for distributed computing, like cloud computing and multiprocessing systems. The second looks at the analysis of new subjects of study, such as social networks. The first issue requires the improvement of current tools, in order to benefit from technological advances regarding calculation capacity; the second represents new objects of study partially covered by current research themes.
This article is the third in a trilogy on data mining:
1- Data mining: From data to knowledge
2- Data visualization, machine learning… Data mining on every front
3- The stakes of data mining in the age of cloud computing and social networks
Paul Erdös: Both subject of study and a leading figure in social charts study. (Source: Wikipedia)
Unlike what was true in the last decade, for reasons of physical limitation, the increase in processor computing power is due less and less to an increase in their calculation frequency, and more to the multiplication of units executing these calculations (CPU). It has become quite common to see machines with 8-core processors, or even many more, in the case of servers. Until the rise of the first multi-core processors, a bit less than ten years ago, when the frequency increased, the algorithm could stay the same and still maintain its technical performance. But when machines were equipped with not one faster brain, but several “brains”, the acceleration of the speed of software execution required rethinking the underlying algorithms and their implementation. This idea holds true when it comes to distributing a cloud calculation over a network of machines. How can we benefit from these technological evolutions?
Algorithms have to be suited to parallel computing or new algorithms must be created. In other words, the task has to be divided into substantial and independent parts, in order for them to be assigned to several CPUs, taking advantage of their numbers. When an algorithm proceeds to a series of calculations in which each calculation depends on the previous one, parallel computing reaches is limits: two calculations cannot be carried out in parallel and use of the algorithm is restricted to one CPU. The challenge, then, is to manage to break the linearity of the steps of an algorithm so as to apply it to parallel computing.
Social networks provide new research themes that also imply rethinking existing problems. Traditionally, data mining concerns relatively linear data of finite dimensions. In a network, people are connected to one another. This raises issues due to the fact that the object of interest isn’t only the person himself, but his links to others, known as the network topology. The underlying idea is that links between people are as important as people in themselves for understanding the dynamics.
The study of social networks isn’t limited to Facebook and company. The relationship between coauthors of scientific publications, telecommunications, purchases or web pages all have similar properties regarding topology. Consequently, methods created to address a problem are cross-functional. The most famous algorithm on this matter is probably the PageRank algorithm, which is behind the domination of Google over other browsers. Beyond the measurement of webpage popularity, data mining can also identify communities of interest within a group of people.
Due to their complexity and the huge amount of data involved, these issues usually require one of the new techniques of data mining. It is indeed questionable to proceed to data sampling, as obtaining a representative sub-chart within a chart isn’t necessarily a simple matter.
Finally, another issue today is the study of dynamics within these charts of huge dimensions. For instance, a telecom operator would want to anticipate a client’s departure (attrition) by looking into his/her phone calls. In the case of scientific publishing, we want to be able to anticipate new dynamics that could arise within a community of authors, in order to predict intersecting themes. We want to predict the emergence of new connections in a very large chart, to anticipate topological evolutions of the space studied. Clearly, one of the most used approaches is the enrichment of junctions with topological properties characterizing their place within a network. Then, we can proceed to a more standard analysis, limiting ourselves to analyzing the junctions independently.
Not only are social networks a new subject of study that creates new data mining applications, but they also answer the “why” of data mining. The emergence of parallel computing, on the other hand, responds to the “how”.
About the author:
Fräntz Miccoli studied engineering at EISTI (International School of Science and Information Processing) and management at the Grenoble School of Management. As an entrepreneur, he has created IT and communication firms: KenaGard and more recently izzijob. He is interested in innovation and, more specifically, in developing information processing sciences.
Data mining: From data to knowledge
Data visualization, machine learning… Data mining on every front
Find out more:
A few details on limitations regarding CPU power
PageRank Algorithm – The Mathematics of Google Search
Gephi, “an interactive visualization and exploration platform for all kinds of networks and complex systems, dynamic and hierarchical graphs”