Unsupervised representation learning for single-cell transcriptomic and epigenomic data
- Authors
- Publication Date
- Dec 07, 2022
- Source
- HAL-Descartes
- Keywords
- Language
- English
- License
- Unknown
- External links
Abstract
In recent years, single-cell transcriptomics and epigenomics have allowed biologist to observe tissues at a new resolution.Using these protocols, we are now able to observe the whole distribution of cell states within a tissue, instead of justmeasuring an aggregate cell state. With these new types of measurements has come the need for new statistical methodsto analyze them. Indeed the previous generation of analysis tools were designed for a regime of few high quality samples,while these new measurements are much higher in quantity, but of significantly lower quality. This problem of low quality iseven more pronounced for single-cell epigenomics protocols, due to cells only having two copies of the genome, comparedto the hundreds of thousands of RNA molecules present in the cell. Since epigenomics and transcriptomics profiles areevaluated across a high number of variables, there has been a great interest in methods for reducing the dimension ofthe data.This explosion of interest has led to numerous new algorithms and a thriving community of methods developers. Theirwork has however not yet been fully adopted by practicing bioinformaticians, either because they were not deemed reliableenough, or because they failed to properly answer biological questions. In this thesis, we measured how reliable thesenew methods are, as well as how they are affected by the steps preceding them. We found that the recent deep learningmethods fail to outperform linear methods on current datasets, for most modalities. We further found, for epigeneticassays, that the feature engineering steps were more important than the dimension reduction algorithm, in order to obtaingood representation of cells. We further attempted to develop a novel algorithm to learn embeddings of epigenomicmeasurements in an end-to-end fashion, learning at once both the low-dimension representation of the cells, as well as the epigenomic annotation.