Affordable Access

deepdyve-link
Publisher Website

Predicting discovery rates of genomic features.

Authors
  • Gravel, Simon
Type
Published Article
Journal
Genetics
Publisher
The Genetics Society of America
Publication Date
Jun 01, 2014
Volume
197
Issue
2
Pages
601–610
Identifiers
DOI: 10.1534/genetics.114.162149
PMID: 24637199
Source
Medline
Keywords
License
Unknown

Abstract

Successful sequencing experiments require judicious sample selection. However, this selection must often be performed on the basis of limited preliminary data. Predicting the statistical properties of the final sample based on preliminary data can be challenging, because numerous uncertain model assumptions may be involved. Here, we ask whether we can predict "omics" variation across many samples by sequencing only a fraction of them. In the infinite-genome limit, we find that a pilot study sequencing 5% of a population is sufficient to predict the number of genetic variants in the entire population within 6% of the correct value, using an estimator agnostic to demography, selection, or population structure. To reach similar accuracy in a finite genome with millions of polymorphisms, the pilot study would require ∼15% of the population. We present computationally efficient jackknife and linear programming methods that exhibit substantially less bias than the state of the art when applied to simulated data and subsampled 1000 Genomes Project data. Extrapolating based on the National Heart, Lung, and Blood Institute Exome Sequencing Project data, we predict that 7.2% of sites in the capture region would be variable in a sample of 50,000 African Americans and 8.8% in a European sample of equal size. Finally, we show how the linear programming method can also predict discovery rates of various genomic features, such as the number of transcription factor binding sites across different cell types.

Report this publication

Statistics

Seen <100 times