Computational method for single cell ATAC-seq imputation and dimensionality reduction

Li, Zhijian; Berlage, Thomas (Thesis advisor); Filho, Ivan Gesteira Costa (Thesis advisor); Schaub, Michael Thomas (Thesis advisor)

Aachen : RWTH Aachen University (2022)
Dissertation / PhD Thesis

Dissertation, RWTH Aachen University, 2022


Chromatin accessibility, or the physical access to chromatinized DNA, plays an essential role in controlling the temporal and spatial expression of genes in eukaryotic cells. Assay for transposase- accessible chromatin followed by high throughput sequencing (ATAC-seq) is a sensitive and straight- forward protocol for profiling chromatin accessibility in a genome-wide manner. Moreover, combined with single-cell sequencing technology, the single-cell ATAC-seq (scATAC-seq) is able to map reg- ulatory variation from hundreds to thousands of cells at single-cell resolution, further expanding its applications. However, a major drawback of scATAC-seq data is its inherent sparsity. In other words, many open chromatin regions are not detected due to low input or loss of DNA material in the scATAC-seq experiment, leaving a large number of missing values in the derived count matrix. Such a phenomenon is known as "drop-outs" and is also observed in other single-cell sequencing data, such as scRNA- seq. Although many computational methods have been proposed to address this issue for scRNA-seq based on data imputation or denoising, there is a substantial lack of efforts to assess the usability of these methods on scATAC-seq data. Moreover, the development of specific algorithms for imputing or denoising scATAC-seq is still poorly explored yet.Another critical issue when dealing with the scATAC-seq matrix is the high dimensionality. Be- cause a gene is often regulated by multiple cis-regulatory elements (CREs), the number of features in scATAC-seq (i.e., peaks) is usually one order magnitude higher compared with the number of features in scRNA-seq (i.e., genes). This high dimensionality poses a challenge for the analysis of scATAC-seq, such as clustering and visualization. Therefore, it is a common option to first perform dimensionality reduction prior to interpreting the data. However, the standard computational meth- ods for scRNA-seq data are potentially unsuitable for this task due to the low-count information of scATAC-seq data, i.e., a maximum of 2 digestion events is expected for an individual cell in a specific open chromatin region.In this thesis, we propose scOpen, a computation approach for simultaneous quantification of single-cell open chromatin status and reduction of the dimensionality, to address the aforementioned issues for scATAC-seq data analysis. More formally, scOpen performs imputation and denoising of a scATAC-seq matrix via regularized non-negative matrix factorization (NMF) based on term frequency-inverse document frequency (TF-IDF) transformation. We show that scOpen is able to improve several crucial downstream analysis steps of scATAC-seq data, such as clustering, visualization, cis-regulatory DNA interactions and delineation of regulatory features. Moreover, we also demonstrate its power to dissect chromatin accessibility dynamics on large-scale scATAC-seq data from intact mouse kidney tissue. Finally, we perform additional analyses to investigate the regulatory programs that drive the development of kidney fibrosis. Our analyses shed novel light on mechanisms of myofibroblasts differentiation driving kidney fibrosis and chronic kidney disease (CKD). Altogether, these results demonstrate that scOpen is a useful computational approach in biological studies involving single-cell open chromatin data processing.


  • Department of Computer Science [120000]
  • Teaching and Research Area (vacant) of Life Science Informatics (Visual Knowledge Management) Teaching and Research Area [122620]