IT technologies have allowed us to accumulate enormous quantities of data on a variety of subjects, from the human genome to simple sales operations and textual data. A piece of raw data, either digital or printed, isn’t really useful in itself. Data mining (also called “knowledge discovery from data”) gives it its value. How can we transform data into sources of knowledge? This is one of the fundamental questions data experts ask themselves. Data mining tries to bring answers that bridge numerous other domains, whether in the form of tools like statistics and operational research, or via a field of application, like sociology, marketing or biology.
This article is the first part of a trilogy on data mining; translations of parts 2 & 3 to follow.
1- Data mining: From data to knowledge
2- Data visualization, machine learning… Data mining on every front
3- The stakes of data mining in the age of cloud computing and social networks
This article is a translation of Le data mining : des données au savoir. It was translated from French by Julia Troufflard.
Mining out of Silverton (adambarhan/flickr)
The expression “data mining” is inadequate. It used to be a pejorative term that earned a new reputation with time and eventually became a discipline. It refers to raw data, while the object of interest would rather be the resulting knowledge. Some authors consider the phrase “data mining” as adequate as the expression “earth mine” would be to describe a gold or diamond mine. Data are not an end product but the raw material. Today, it is accepted that the expression knowledge discovery from data (KDD) is more precise and less debatable, though it is not used as often. Dating the emergence of this science is far from easy, given a legitimate question that must be asked: Should we count from the creation of the first databases (several thousand years ago)? Or from the development of the first algorithms which, like neural networks, form its basis today (going back to 1943, to Pitts and McCulloch’s work)? Or from the algorithms developed specifically for this discipline in the 1980s?
Without a precise date of birth, the rise of data mining is nevertheless linked to an origin myth almost always mentioned in introductions to this subject. It is said that one day Walmart, the huge American department store group, decided to analyze its vast sales database. This highlighted a significant correlation between sales of diapers and beer on Saturdays. Why? It turned out that Saturday was a choice day for last minute diaper purchases. Fathers who came to shop for the family were tempted to buy beer for the evening’s sporting events. Walmart supposedly used this information to reorganize its shelves in order to encourage this purchasing behavior. Which, it is said, resulted in an increase in sales. Today, this anecdote seems to be more legend than reality.
How can we extract knowledge from data? This question is usually divided into two sub-issues. The first is phenomenon prediction (classification). For instance, say we have a set of information from a database of banking operations: amount, date, account balance, place of transaction, average transaction amount, average transaction frequency by location, and fraudulent nature of the operation. The game is to identify the characteristics of an illegal transaction. We must keep in mind that this is not about testing a hypothesis, as we would hope to do in statistics, but starting with data only and expecting the system to produce a hypothesis by way of a model.
The second issue aims at structuring a set of data so as to form coherent groups in accordance with sometimes varied criteria (clustering). In this case, the obvious example that comes to mind is a browser, but we could also think of the creation of a market segmentation which, mathematically, comes down to determining the data space of subsets corresponding to zones of highly concentrated points, such that each point belongs only to one subset.
It is quite common to hear data mining accused of being nothing more than statistics, bastardized by a lack of formality. Although data mining does call on statistical methods, such as regression analysis, we can’t reduce it to that. This domain has its own tools and borrows some from other disciplines. It also offers new perspectives – for instance, due to its close ties to information theory. Two funny quotations illustrate this debate:
"Machine learning (Author’s note: artificial learning closely linked to data mining) is statistics minus any checking of models and assumptions." Brian D. Ripley
To which Andrew Gelman responded:
"In that case, maybe we should get rid of checking of models and assumptions more often. Then maybe we'd be able to solve some of the problems that the machine learning people can solve but we can't!"
Data mining applications are legion. In marketing, data are the basis to suggest new promotions encouraging purchasing behaviors previously identified. They can also provide information used to recommend products to shoppers in accordance with their previous purchases, as is the case with Amazon. In banking, data mining is used in “credit scoring”, that is, deciding whether or not to give a loan to a client, according to the information about the individual. This method aims to reduce the number of defaulters. When it comes to quality, it can be a tool to minimize breakdowns by helping to understand how they occur. Police forces also use it to identify and anticipate “risk zones”. In biology, data mining is used to identify the causes of some diseases. This is how, for instance, sequencing of the human genome opens new doors. Even if, today, the amount of data contained in encoded amino acids is far too much for current technologies, in the future we can imagine that the analysis of these precise data will allow for the study of genetic factors in certain illnesses.
About the author:
Fräntz Miccoli studied engineering at EISTI (International School of Science and Information Processing) and management at the Grenoble School of Management. As an entrepreneur, he has created IT and communication firms: KenaGard and more recently izzijob. He is interested in innovation and, more specifically, in developing information processing sciences.
Find out more:
Charleston police to use data mining to fight crime:
The collaboration graph with Paul Erdös, a benchmark in mathematics publication:
A leading publication on the subject:
A few words on McCulloch and Pitts’ Artificial Neuron model (1943): http://en.wikipedia.org/wiki/Artificial_neuron - History