Data mining is a melting pot in which we find several related disciplines. It is often reduced to machine learning, the area in which the most progress has been made, but to which it is not restricted. Data mining calls on methods linked to data manipulation, to their visualization, and their preprocessing. Machine learning only comes into it later, even if it is the most time-consuming operation.
This article is the second part of a trilogy on data mining:
2- Data visualization, machine learning… Data mining on every front
3- The stakes of data mining in the age of cloud computing and social networks (Coming soon in English)
Among the various stages of a data mining process, the first usually consists of data visualization. Although, regrettably, little emphasis is put on it in university curricula, which might be a bit too attached to the rationality of numbers, visualization is an interesting subject, because it is one of the first tools used in the analysis chain. As the saying goes, “a picture is worth a thousand words”, and visualization is also one of the most accessible aspects for laymen. In order to understand perfectly the issue of data visualization, it is important to realize that data are usually represented in spaces that almost always reach a hundred dimensions and generally go up to a thousand and even a million dimensions (or more) in the most extreme cases. Although we are able to represent considerable data correctly in a two-dimensional space, we begin to resort to tricks for three dimensions. The addition of a fourth dimension usually includes the variable time, by simple analogy with our perception of the physical world. The representation of more than four dimensions is a complex matter. In this case, the eye has to manage to see analogies with the physical world. For instance, the statement “A is close to B” must be true in both the representation and in the data. To illustrate this, one of the most common solutions is to represent data on every possible projection in a two-dimensional space. (See illustration: The data represented are related to the classification of three plant species according to physiological characteristics).
Data preprocessing is also part of the neglected themes of data mining. It aims at compensating for the shortcomings of data: incorrect values, extreme values, missing values. Some algorithms are not affected by missing values, whereas others are. What should we do in that case? One idea could be to delete the troublesome values. Out of several million or billion receipts, the operation could be harmless. However, if you face a problem with 100 attributes, of which 1% are missing values, you could end up eliminating 80% of your inputs, simply because one box is missing in each line. Almost all data mining algorithms work by considering their data to be lines of a database table. One of the preprocessing stages can also transform data so as to reduce them to lines. This preprocessing, widely used in social network analysis, can become a critical and time-consuming stage when the object of study is complex: a sound, a text or a picture. In this last case, image processing tools, like binarization, dilatation and edge detection, are generally used and a picture will usually be reduced to its signature before going on to manipulate it.
One of the main stakes of data mining is knowledge discovery automation. This is a central theme in machine learning (artificial learning). Many people even reduce data mining to this sole field of study, as we mentioned in our previous article. There are two broad families of models here: supervised (or semi-supervised) learning and non-supervised learning. Supervised learning is involved in classification issues. In this case, there is a way to check what the algorithm has learned and, like a professor pointing out his or her pupil’s mistakes, we allow the algorithm to “understand” how to correct itself. Among those methods we find artificial neural networks, decision trees, Bayesian approaches, and regression analyses. On other issues, though, we would like to see dynamics emerge about which we have no prior knowledge (clustering). For instance, if we would like to add a corpus of texts divided into groups, the result could be a thematic or stylistic classification, according to the data initially given. This type of method has a very concrete application in trying to find the authors of anonymous texts. Many of these models being parametric, we observe a certain similarity to operational research issues, more commonly called “optimization” by laymen.
As data are generally split up between heterogeneous systems or organized in an unusable way, data mining is never far removed from database and data warehouse1 manipulation. These themes are usually gathered under the acronym ETL (Extract, Transform, Load), which aims at finding common points, facilitating analysis and structuring heterogeneous data. For instance, a data warehouse will often aggregate data grouping them together by similarity (e.g.: by month and date for an invoice) and precalculate averages. This method makes what would be very long calculations for unitary data almost instantaneous. This, along with data visualization and dashboard creation issues, represents the work of half the business intelligence consultants out there.
1Database structured so as to optimize data analysis more than data manipulation.
About the author:
Fräntz Miccoli studied engineering at EISTI (International School of Science and Information Processing) and management at the Grenoble School of Management. As an entrepreneur, he has created IT and communication firms: KenaGard and more recently izzijob. He is interested in innovation and, more specifically, in developing information processing sciences.
Find out more:
Visualize your social network and its communities on LinkedIn:
A leading publication on the subject:
A tutorial on Weka, one of the most widely used solutions among universities: