Affordable Access

Access to the full text

Multidimensional Feature Selection and High Performance ParalleX

Authors
  • Niedzielewski, Karol1
  • Marchwiany, Maciej E.1
  • Piliszek, Radoslaw2
  • Michalewicz, Marek1
  • Rudnicki, Witold1, 2, 3
  • 1 University of Warsaw, Interdisciplinary Centre for Mathematical and Computational Modelling (ICM), Warsaw, Poland , Warsaw (Poland)
  • 2 University of Bialystok, Computational Centre, Białystok, Poland , Białystok (Poland)
  • 3 University of Bialystok, Institute of informatics, Białystok, Poland , Białystok (Poland)
Type
Published Article
Journal
SN Computer Science
Publisher
Springer Singapore
Publication Date
Oct 24, 2019
Volume
1
Issue
1
Identifiers
DOI: 10.1007/s42979-019-0037-5
Source
Springer Nature
Keywords
License
Green

Abstract

Great amount of stored information used in connection with Machine Learning and statistical methods enables high quality insight and analysis of data that leads to design of high precision predictive and classification systems. In the process of analysis, selection of most informative features is crucial for later quality of the designed system. In this report, we propose two implementations of multidimensional feature selection (MDFS) algorithm (Piliszek et al. in Mdfs-multidimensional feature selection. arXiv preprint. arXiv:1811.00631, 2018) that can be used in distributed environments for detection of all-relevant variables in data sets with discrete decision variable. While most methods discard information about interactions between features, MDFS is designed towards identification of informative variables that are not relevant when considered alone but are relevant in groups. We have developed software using C++ and High Performance ParalleX (HPX) (Kaiser et al. in STEllAR-GROUP/hpx: HPX V1.3.0: the C++ Standards library for parallelism and concurrency. 2019. 10.5281/zenodo.3189323, 2019) to achieve best performance, great scalability and portability. HPX is a library that uses lightweight threads, asynchronous communication, and asynchronous task submission based on the declarative criteria of work. These features enabled us to deeply explore granularity and parallelism of the MDFS algorithm. Software is prepared entirely in C++; therefore, calculations can be performed using CPUs on desktops, distributed systems, and any system with C++ compiler support. During testing on Cray XC40 (Okeanos) using artificially prepared data, we achieved 196 times acceleration on 256 nodes compared to a single node. From this point, ICM computing facility is capable of massively parallel feature engineering. The main purpose of the software is to enable researchers for more accurate genomics data analysis in search for multiple correlations in potential sources of the diseases.

Report this publication

Statistics

Seen <100 times