Identifying topologically associating domains using differential kernels

Luka Maisuradze,Megan C. King,Ivan V. Surovtsev,Simon G. J. Mochrie,Mark D. Shattuck,Corey S. O'Hern
2023-12-22
Abstract:Chromatin is a polymer complex of DNA and proteins that regulates gene expression. The three-dimensional structure and organization of chromatin controls DNA transcription and replication. High-throughput chromatin conformation capture techniques generate Hi-C maps that can provide insight into the 3D structure of chromatin. Hi-C maps can be represented as a symmetric matrix where each element represents the average contact probability or number of contacts between two chromatin loci. Previous studies have detected topologically associating domains (TADs), or self-interacting regions in Hi-C maps within which the contact probability is greater than that outside the region. Many algorithms have been developed to identify TADs within Hi-C maps. However, most TAD identification algorithms are unable to identify nested or overlapping TADs and for a given Hi-C map there is significant variation in the location and number of TADs identified by different methods. We develop a novel method, KerTAD, using a kernel-based technique from computer vision and image processing that is able to accurately identify nested and overlapping TADs. We benchmark this method against state-of-the-art TAD identification methods on both synthetic and experimental data sets. We find that KerTAD consistently has higher true positive rates (TPR) and lower false discovery rates (FDR) than all tested methods for both synthetic and manually annotated experimental Hi-C maps. The TPR for KerTAD is also largely insensitive to increasing noise and sparsity, in contrast to the other methods. We also find that KerTAD is consistent in the number and size of TADs identified across replicate experimental Hi-C maps for several organisms. KerTAD will improve automated TAD identification and enable researchers to better correlate changes in TADs to biological phenomena, such as enhancer-promoter interactions and disease states.
Genomics
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to accurately identify topological associated domains (TADs) in the interaction maps generated by high - throughput chromatin conformation capture techniques such as Hi - C. TADs are regions in the genome that tend to self - interact rather than interact with neighboring regions, and these regions play a crucial role in regulating gene expression and cell function. However, most of the existing TAD identification algorithms have the following problems: 1. **Inability to identify nested or overlapping TADs**: Many existing algorithms can only identify non - nested, non - overlapping TADs, which limits their ability to analyze complex chromatin structures. 2. **Inconsistent results**: For the same Hi - C map, the positions and numbers of TADs identified by different methods vary greatly, lacking consistency. 3. **Sensitivity to noise and sparsity**: The noise and sparsity in the experimental data will affect the accuracy of TAD identification, and the existing methods are not robust enough in this regard. To overcome these problems, the authors developed a new TAD identification algorithm - KerTAD. This algorithm utilizes kernel techniques in computer vision and image processing and can more accurately identify nested and overlapping TADs and performs well under noisy and sparse conditions. The paper verifies that the performance of KerTAD is superior to existing methods by conducting benchmark tests on synthetic data and experimental data. ### Main contributions - **Proposed a new TAD identification algorithm KerTAD**, which can identify nested and overlapping TADs. - **Conducted extensive tests on synthetic data and manually annotated experimental data**, demonstrating the superiority of KerTAD in terms of true positive rate (TPR) and false discovery rate (FDR). - **Showed the stability and consistency of KerTAD in different biological replicate experiments**, which helps to better understand the role of TADs in gene regulation and disease formation. ### Method overview The main steps of the KerTAD algorithm include: 1. **Pre - processing**: Normalize and perform total variation regularization on the input Hi - C matrix to reduce noise and fluctuations while preserving edge features. 2. **Feature extraction**: Calculate the discrete partial derivative and construct a similarity matrix to generate two masks, which are respectively used to extract point features and corner area features. 3. **Final mask generation**: Perform element - wise multiplication on the two masks to obtain the final binary mask, and each non - zero element represents a predicted TAD corner point. 4. **Output TAD boundaries**: Convert the final mask into a two - column list, and each row represents the start and end indices of a TAD. ### Results - **Performance on synthetic data**: KerTAD shows a high TPR and a low FDR on both simple and complex synthetic Hi - C maps. - **Performance on experimental data**: KerTAD also shows a high TPR on the manually annotated experimental Hi - C maps and has strong robustness to noise and sparsity. - **Consistency across biological replicate experiments**: The number and size of TADs identified by KerTAD are consistent in different biological replicate experiments. Through these improvements, KerTAD is expected to improve the automation level of TAD identification and help researchers better understand the role of TADs in biological processes.