Big data and deep learning for RNA biology

Hyeonseo Hwang,Hyeonseong Jeon,Nagyeong Yeo,Daehyun Baek
DOI: https://doi.org/10.1038/s12276-024-01243-w
2024-06-15
Experimental & Molecular Medicine
Abstract:The exponential growth of big data in RNA biology (RB) has led to the development of deep learning (DL) models that have driven crucial discoveries. As constantly evidenced by DL studies in other fields, the successful implementation of DL in RB depends heavily on the effective utilization of large-scale datasets from public databases. In achieving this goal, data encoding methods, learning algorithms, and techniques that align well with biological domain knowledge have played pivotal roles. In this review, we provide guiding principles for applying these DL concepts to various problems in RB by demonstrating successful examples and associated methodologies. We also discuss the remaining challenges in developing DL models for RB and suggest strategies to overcome these challenges. Overall, this review aims to illuminate the compelling potential of DL for RB and ways to apply this powerful technology to investigate the intriguing biology of RNA more effectively.
biochemistry & molecular biology,medicine, research & experimental
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is: how to use big data and deep - learning techniques to promote the research of RNA biology (RB). Specifically, the paper focuses on the following aspects: 1. **Data Acquisition and Processing**: - How to acquire and process large - scale RNA biology data from public databases to construct a data set suitable for deep - learning model training. - Public databases such as GEO, SRA, ENCODE, etc. provide abundant RNA - related data, but these data have problems such as incomplete metadata and inconsistent formats. How to effectively filter, label, and normalize these data. 2. **Application of Deep - Learning Models**: - The application of deep - learning models in RNA biology, including supervised learning, self - supervised learning, domain adaptation, meta - learning, and data augmentation methods. - For example, supervised learning is used for miRNA target prediction, gene expression prediction and other tasks; self - supervised learning learns biological background knowledge through a large amount of unlabeled data; domain adaptation is used for transfer learning of cross - domain data. 3. **Choice of Encoding Methods**: - How to encode RNA biology data (such as nucleic acid sequences) into a form suitable for the input of deep - learning models. - Encoding methods include the application of one - hot encoding, k - mer sliding - window encoding, word2vec and other natural - language - processing techniques to capture complex features in the sequence. 4. **Challenges and Solutions**: - The challenges faced by deep - learning in the application of RNA biology, such as the lack of optimized biological data and task architectures, and the difficulty in understanding the causal relationships behind the model prediction results. - Some coping strategies are proposed, such as choosing appropriate training methods, improving data quality, and combining biological domain knowledge. In summary, this paper aims to explore how to make full use of big data and deep - learning techniques, overcome existing challenges, and study RNA - related biological processes more effectively.