Affordable Access

Publisher Website

A systematic study on latent semantic analysis model parameters for mining biomedical literature

BMC Bioinformatics
Springer (Biomed Central Ltd.)
Publication Date
DOI: 10.1186/1471-2105-10-s7-a6
  • Meeting Abstract
  • Computer Science
  • Linguistics
  • Medicine

Abstract ral ss BioMed CentBMC Bioinformatics Open AcceMeeting abstract A systematic study on latent semantic analysis model parameters for mining biomedical literature Mohammed Yeasin1, Haritha Malempati*1, Ramin Homayouni2 and Mohammad Shahed Sorower1 Address: 1Department of Electrical and Computer Engineering, University of Memphis, Memphis, TN 38111, USA and 2Bioinformatics Program, University of Memphis, Memphis, TN 38111, USA Email: Haritha Malempati* - [email protected] * Corresponding author Background and rationale Latent semantic analysis (LSA) is considered to be an effi- cient text mining technique [1] but most approaches developed on this paradigm are based on adhoc princi- ples. A systematic study on the parameters affecting the performance of LSA is expected to provide guidelines to objectively select the LSA model parameters in a way that is consistent with the data and the application. In this study, empirical analyses were conducted using a previ- ously published 50 gene data set [2] to examine the effects of the following parameters (outlined in Figure 1): Param- eters are: (i) stemming, stop-words and word counts (to discard abstract with not enough information), (ii) corpus content (e.g., abstracts with and without titles), (iii) inclu- sion or exclusion of the dc component or 1st Eigen vector (that adds bias to the model), (iv) objective criteria to choose the number of factors (Eigen vectors) to create the model, (v) information theoretic criteria to select features (words in the corpus) instead of considering complete set of features. Methodology Two datasets, one with titles and abstracts and the other with only abstracts were used to conduct empirical analy- ses. Preprocessing steps included stemming, stop word removal, as well as removal of documents with less than using the dataset. Singular value decomposition (SVD) on the TF-IDF matrix was used to compute the encoding of the dataset and only k components were retained bas

There are no comments yet on this publication. Be the first to share your thoughts.