Affordable Access

Handling Large Data Files : A Deterministic Approach

Authors
Publication Date
Disciplines
  • Computer Science

Abstract

Induction systems have been successfully applied in a wide range of learning applications. However, they do not scale up to large scientific and business data sets. Applying a large training set (e.g., one million patterns) to a learning algorithm will result in: • An excessive amount of training time; • The inability to address the training set. This thesis presents a feasible solution to the problems generated by the limited amount of the resources time (e.g., training time) and space (e.g., main memory). Both problems have a joint cause, a too largedata set (e.g., a training set) is applied to an algorithm (e.g., a machine learning algorithm). One problem occurs as a shortage of time, the other as a shortage of space. Generalizing both problems will yield a single problem, and a deterministic approach to this problem is necessary to provide a convenient premise. In other words, the joint cause of both problems implies a joint solution which can be found by a deterministic approach to the matter. The essence of the solution is a histogram of each dimension of the data space (the data space is defined by the data set). The histograms are equalized by using an operation closely related to histogram equalizing, namely bin (bar) equalizing. By combining all histograms into a single data structure, a so—called mirror image of the data set is acquired. The mirror image provides information on the data set, and its resolution or accuracy depends on the number of bins of the histograms of which it is composed. An equalized histogram of a specific dimension can be interpreted as an intersection of the data space. This intersection provides information on the dimension at issue, it does not provide information on otherdimensions, i.e., a single intersection is one—dimensional. The mirror image combines the intersections, and, as a result, it does provide information on all dimensions of the data space. The mirror image is a small sized structure which efficiently provides information on the data set. Each record in the data set defines a data point in the data space at a specific location. By verifying the location by means of the mirror image (one record a time), a record is either copied into a reduced data set (i.e., the sample set) set or is rejected. In other words, a record is either suitable or not suitable (i.e., it can or it cannot provide useful information to the sample set). This process is called: • Deterministic sampling. If a record has be to retrieved from a data set, the same process can be maintained. The only difference is the source of the properties of a record. The properties are now supplied by, e.g., the learning algorithm and not by the record itself. Addressing by means of the mirror image is virtually similar to deterministic sampling, and it is therefore denominated: • Deterministic addressing. Except for their premise, deterministic sampling and deterministic addressing do not differ. After all, both resource related problems have a joint cause, and a joint cause implies a joint solution.

There are no comments yet on this publication. Be the first to share your thoughts.