Single-cell multi-omics data integration powered by PCA-like autoencoders
- Authors
- Publication Date
- Jun 25, 2024
- Source
- HAL
- Keywords
- Language
- English
- License
- Unknown
- External links
Abstract
Single-cell methodologies are recognized for their capacity to elucidate tissue heterogeneity at a detailed level, sequencing each cell's own characteristics. First used for RNA sequencing, it provides critical information to better understand the behavior of cells and the cellular microenvironment. It has rapidly evolved, now offering new opportunities across a range of fields, from genomics to metabolomics, collectively known as 'omics'. Each layer of omics provides access to different and complementary information about the cell. In transcriptomics, single-cell RNA sequencing (scRNAseq) enables access to the expression patterns of individual genes, while in genomics, single-cell Assay for Transposase-Accessible Chromatin sequencing (scATACseq) provides insights into chromatin accessibility, identifying regions of the DNA that can be expressed and co-expressed. To have access to all the characteristics of a tissue, it is necessary to integrate multiple datasets from different omics. Horizontal integration of datasets, using methods like Combat, KNN, and Harmony, and vertical integration, employing algorithms such as Mofa+ and Signac, have demonstrated effectiveness in multi-omics analysis. To capture sufficient data for comprehensive multi-omic studies, both horizontal and vertical integrations must be employed concurrently. When some data are missing, this approach is referred to as mosaic integration. Our rationale posits that features derived from datasets share a common informational element: the cell type. This is why we focus our analysis on genes, whereas most single-cell studies typically concentrate on cells. We reduce the dimensionality of the features in each dataset while preserving as much information as possible, similar to a PCA. This reduced-dimension representation is referred to as a latent space. Our objective is to create a shared latent space for all different omics using a neural network known as an autoencoder. An autoencoder is a form of artificial intelligence consisting of two components: an encoder that reduces the dimensionality of the features and a decoder that attempts to reconstruct the original data. This architecture creates a latent space that can be constrained using loss functions. We first use a linear function to merge all cells from a dataset into a reduce number of coarse cells (currently 1024). The coarse cells allow us to retain a maximum amount of information from large datasets while maintaining a manageable number of features for the autoencoder. A standard autoencoder approach was not sufficient to detect the involvement of an observation in a cell type. That's why we use PCAAE approach, made of stacked encoder and one decoder, which mimics the behavior of PCA while keeping the advantages from non-linearity. Each encoder will compute the latent space for one dimension. We add two objectives during the training: each dimension of the latent space must be independent and must be train one after the other to push the first dimension to learn the big picture of the dataset whereas the last dimension will focus on details.