Comparison of computational methods for 3D genome analysis at single-cell Hi-C level

Xiao Li,Ziyang An,Zhihua Zhang
DOI: https://doi.org/10.1016/j.ymeth.2019.08.005
IF: 4.647
2020-10-01
Methods
Abstract:<p>Hi-C is a high-throughput chromosome conformation capture technology that is becoming routine in the literature. Although the price of sequencing has been dropping dramatically, high-resolution Hi-C data are not always an option for many studies, such as in single cells. However, the performance of current computational methods based on Hi-C at the ultra-sparse data condition has yet to be fully assessed. Therefore, in this paper, after briefly surveying the primary computational methods for Hi-C data analysis, we assess the performance of representative methods on data normalization, identification of compartments, Topologically Associating Domains (TADs) and chromatin loops under the condition of ultra-low resolution. We showed that most state-of-the-art methods do not work properly for that condition. Then, we applied the three best-performing methods on real single-cell Hi-C data, and their performance indicates that compartments may be a statistical feature emerging from the cell population, while TADs and chromatin loops may dynamically exist in single cells.</p>
biochemistry & molecular biology,biochemical research methods
What problem does this paper attempt to address?
The core problem that this paper attempts to solve is how to evaluate and compare the performance of different computational methods under the condition of low - resolution (especially single - cell - level) Hi - C data. Specifically, the research focuses on the following aspects: 1. **Data Normalization**: The paper evaluates the performance of current mainstream Hi - C data normalization methods (such as ICE and HiCNorm) under the condition of ultra - sparse data. The results show that the ICE method is very sensitive to data resolution and may fail in single - cell Hi - C data, while HiCNorm is stable but time - consuming in calculation and difficult to be applied to large - scale single - cell research. 2. **Compartments Identification**: The research evaluates the performance of three compartment identification methods (Juicer, CscoreTool and GeSICA) in down - sampled data. The results show that these methods cannot work properly at the sparse level of single - cell Hi - C data, among which Cscore is relatively more stable. 3. **Topologically Associating Domains (TADs) Detection**: The author selects several representative TAD detection algorithms (such as IS, deDoc, etc.) for evaluation. The results show that under the condition of ultra - low resolution, IS and deDoc perform better than other methods, but at the single - cell level, all methods may fail. 4. **Chromatin Loops Detection**: The paper tests six representative chromatin loop detection tools (such as HiCCUPS, diffHiC, etc.). Most methods show an exponential performance decline when the amount of data is reduced, and only fastHiC still has certain functionality in single - cell Hi - C data. 5. **Actual Performance in Single - Cell Hi - C Data Analysis**: Finally, the author applies the better - performing IS and fastHiC to real single - cell Hi - C data. The results show that: - Compartments are difficult to be clearly identified in single - cell data. - TADs can be meaningfully identified in single - cells and there are differences between cells. - Chromatin loops are almost invisible in single - cell data, and even the best - performing fastHiC does not significantly outperform the simple baseline predictor. ### Summary The main objective of the paper is to reveal the limitations of existing computational methods under the condition of low - resolution (especially single - cell) Hi - C data, and to provide references for the future development of new methods suitable for single - cell Hi - C data analysis. The research shows that although some methods still have certain performance under the condition of ultra - low resolution, their applicability at the single - cell level is still limited. This suggests that we need to further improve algorithms or develop new methods to better analyze the three - dimensional genome structure at the single - cell level. Formula Summary: - **Adjusted Mutual Information (AMI)**: $$ AMI(T, K)=\frac{MI(T, K)-E\{MI(T, K)\}}{\max\{H(T), H(K)\}-E\{MI(T, K)\}} $$ where $MI(T, K)$ is the mutual information, defined as: $$ MI(T, K)=\sum_{i = 1}^{n}\sum_{j = 1}^{m}P(i, j)\log\left(\frac{P(i, j)}{P(i)P'(j)}\right) $$ $P(i)=\frac{|T_i|}{N}$, $P'(j)=\frac{|K_j|}{N}$, $P(i, j)=\frac{|T_i\cap K_j|}{N}$. - **Weight Similarity (WS)**: $$ WS(T, K)=\frac{\sum_{j = 1}^{m}S_{TK}(j)*|K_j|}{\sum_{j = 1}^{m}|K_j|} $$ where $S_{TK}(j)=\max_{i = 1}^{n}\left\{\frac{|T_i\cap K_j|}{|T_i|*|K_j|}\right\}$.