Toward a generic feature set defined by consensus peaks as a consistent reference for ATAC-seq data

Qiuchen Meng,Xinze Wu,Yubo Zhao,Wenchang Chen,Chen Li,Zheng Wei,Jiaqi Li,Xi Xi,Sijie Chen,Catherine Zhang,Shengquan Chen,Jiaqi Li,Xiaowo Wang,Rui Jiang,Lei Wei,Xuegong Zhang
DOI: https://doi.org/10.1101/2023.05.30.542889
2024-07-03
Abstract:The rapid advancement of transposase-accessible chromatin using sequencing (ATAC-seq) technology, particularly with the emergence of single-cell ATAC-seq (scATAC-seq), has accelerated the studies of regulatory element identification, demanding higher precision and uniformity in feature definition. Unlike gene expression data, no consistent feature reference is developed for ATAC-seq data, which hinders single-cell level data analysis and cell atlas creation. Based on a systematic analysis of 1,785 ATAC-seq and 231 scATAC-seq datasets, we found that cells share the same feature set represented by potential open regions (PORs) on the genome. We proposed a unified reference called consensus peaks (cPeaks) to represent PORs across all observed cell types, and developed a deep-learning model to predict cPeaks unseen in the collected data. The observed and predicted cPeaks defined a generic feature set in the human genome, which can be used as a reference for all ATAC-seq data to align to. Experiments showed that using this reference to integrate scATAC-seq data can improve cell annotation and facilitate the discovery of rare cell types. cPeaks also performed well in establishing cell atlas, and analyzing cells in dynamic or disease states.
Bioinformatics
What problem does this paper attempt to address?