An Efficient Data Indexing Approach on Hadoop Using Java Persistence API
- Authors
- Publication Date
- Oct 13, 2010
- Identifiers
- DOI: 10.1007/978-3-642-16327-2_27
- OAI: oai:HAL:hal-01055056v1
- Source
- HAL-SHS
- Keywords
- Language
- English
- License
- Unknown
- External links
Abstract
Data indexing is common in data mining when working with high-dimensional, large-scale data sets. Hadoop, a cloud computing project using the MapReduce framework in Java, has become of significant interest in distributed data mining. To resolve problems of globalization, random-write and duration in Hadoop, a data indexing approach on Hadoop using the Java Persistence API (JPA) is elaborated in the implementation of a KD-tree algorithm on Hadoop. An improved intersection algorithm for distributed data indexing on Hadoop is proposed, it performs O(M+logN), and is suitable for occasions of multiple intersections. We compare the data indexing algorithm on open dataset and synthetic dataset in a modest cloud environment. The results show the algorithms are feasible in large-scale data mining.