Toward a generic feature set defined by consensus peaks as a consistent reference for ATAC-seq data

Qiuchen Meng,Xinze Wu,Yubo Zhao,Wenchang Chen,Chen Li,Zheng Wei,Jiaqi Li,Xi Xi,Sijie Chen,Catherine Zhang,Shengquan Chen,Jiaqi Li,Xiaowo Wang,Rui Jiang,Lei Wei,Xuegong Zhang
DOI: https://doi.org/10.1101/2023.05.30.542889
2024-07-03
Abstract:The rapid advancement of transposase-accessible chromatin using sequencing (ATAC-seq) technology, particularly with the emergence of single-cell ATAC-seq (scATAC-seq), has accelerated the studies of regulatory element identification, demanding higher precision and uniformity in feature definition. Unlike gene expression data, no consistent feature reference is developed for ATAC-seq data, which hinders single-cell level data analysis and cell atlas creation. Based on a systematic analysis of 1,785 ATAC-seq and 231 scATAC-seq datasets, we found that cells share the same feature set represented by potential open regions (PORs) on the genome. We proposed a unified reference called consensus peaks (cPeaks) to represent PORs across all observed cell types, and developed a deep-learning model to predict cPeaks unseen in the collected data. The observed and predicted cPeaks defined a generic feature set in the human genome, which can be used as a reference for all ATAC-seq data to align to. Experiments showed that using this reference to integrate scATAC-seq data can improve cell annotation and facilitate the discovery of rare cell types. cPeaks also performed well in establishing cell atlas, and analyzing cells in dynamic or disease states.
Bioinformatics
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to establish a unified feature reference set for ATAC - seq (Assay for Transposase - Accessible Chromatin using sequencing) data, in order to improve the consistency and precision of single - cell - level data analysis and cell atlas creation. Specifically, the researchers aim to: 1. **Define a consistent feature reference**: Unlike gene expression data, ATAC - seq data lack a unified feature reference, which hinders single - cell - level data analysis and cell atlas construction. To this end, the researchers proposed a universal feature set based on consensus peaks (cPeaks). 2. **Improve the precision of single - cell ATAC - seq data analysis**: By using cPeaks as a reference, single - cell ATAC - seq data can be better integrated, thereby improving cell annotation and helping to discover rare cell types. 3. **Promote the study of complex biological states**: cPeaks performs well in analyzing complex biological states (such as cell differentiation, organ development, and disease states), can provide a more reliable reference, and help researchers understand these processes more in - depth. 4. **Construct a large - scale human chromatin accessibility atlas**: cPeaks, as a universal feature set, can be used to construct a large - scale human chromatin accessibility atlas, providing basic resources for subsequent research. ### Main methods and techniques - **Data collection and processing**: The researchers systematically analyzed 1,785 ATAC - seq and 231 single - cell ATAC - seq datasets, extracted potential open regions (PORs) from them, and defined consensus peaks (cPeaks). - **Deep - learning model**: In order to predict cPeaks not found in existing data, the researchers trained a deep - learning model that can predict new cPeaks based on sequence features. - **Verification and application**: Through a series of experiments, the researchers verified the performance of cPeaks in downstream analysis, including cell annotation, rare - cell - type identification, and complex - biological - state analysis. ### Conclusion Through the above methods, the researchers successfully defined a universal feature set containing approximately 1.4M observed cPeaks and approximately 0.2M predicted cPeaks. This feature set not only covers all tissues and cell types but also performs excellently in single - cell ATAC - seq data analysis and can significantly improve the accuracy of data integration and biological interpretation.