Affordable Access

Publisher Website

b-move: faster bidirectional character extensions in a run-length compressed index.

Authors
  • Depuydt, Lore1
  • Renders, Luca1
  • de Vyver, Simon Van2
  • Veys, Lennart2
  • Gagie, Travis3
  • Fostier, Jan1
  • 1 Ghent University - imec, Technologiepark 126, 9052 Ghent, Belgium. , (Belgium)
  • 2 Ghent University, Technologiepark 126, 9052 Ghent, Belgium. , (Belgium)
  • 3 Dalhousie University, 6050 University Avenue, PO BOX 15000, Halifax, NS B3H 4R2, Canada. , (Canada)
Type
Published Article
Journal
bioRxiv : the preprint server for biology
Publication Date
Jun 02, 2024
Identifiers
DOI: 10.1101/2024.05.30.596587
PMID: 38854079
Source
Medline
Keywords
Language
English
License
Unknown

Abstract

Due to the increasing availability of high-quality genome sequences, pan-genomes are gradually replacing single consensus reference genomes in many bioinformatics pipelines to better capture genetic diversity. Traditional bioinformatics tools using the FM-index face memory limitations with such large genome collections. Recent advancements in run-length compressed indices like Gagie et al.'s r-index and Nishimoto and Tabei's move structure, alleviate memory constraints but focus primarily on backward search for MEM-finding. Arakawa et al.'s br-index initiates complete approximate pattern matching using bidirectional search in run-length compressed space, but with significant computational overhead due to complex memory access patterns. We introduce b-move, a novel bidirectional extension of the move structure, enabling fast, cache-efficient bidirectional character extensions in run-length compressed space. It achieves bidirectional character extensions up to 8 times faster than the br-index, closing the performance gap with FM-index-based alternatives, while maintaining the br-index's favorable memory characteristics. For example, all available complete E. coli genomes on NCBI's RefSeq collection can be compiled into a b-move index that fits into the RAM of a typical laptop. Thus, b-move proves practical and scalable for pan-genome indexing and querying. We provide a C++ implementation of b-move, supporting efficient lossless approximate pattern matching including locate functionality, available at https://github.com/biointec/b-move under the AGPL-3.0 license.

Report this publication

Statistics

Seen <100 times