nach0: Multimodal Natural and Chemical Languages Foundation Model

Micha Livne,Zulfat Miftahutdinov,Elena Tutubalina,Maksim Kuznetsov,Daniil Polykovskiy,Annika Brundyn,Aastha Jhunjhunwala,Anthony Costa,Alex Aliper,Alán Aspuru-Guzik,Alex Zhavoronkov
DOI: https://doi.org/10.1039/D4SC00966E
2024-05-02
Abstract:Large Language Models (LLMs) have substantially driven scientific progress in various domains, and many papers have demonstrated their ability to tackle complex problems with creative solutions. Our paper introduces a new foundation model, nach0, capable of solving various chemical and biological tasks: biomedical question answering, named entity recognition, molecular generation, molecular synthesis, attributes prediction, and others. nach0 is a multi-domain and multi-task encoder-decoder LLM pre-trained on unlabeled text from scientific literature, patents, and molecule strings to incorporate a range of chemical and linguistic knowledge. We employed instruction tuning, where specific task-related instructions are utilized to fine-tune nach0 for the final set of tasks. To train nach0 effectively, we leverage the NeMo framework, enabling efficient parallel optimization of both base and large model versions. Extensive experiments demonstrate that our model outperforms state-of-the-art baselines on single-domain and cross-domain tasks. Furthermore, it can generate high-quality outputs in molecular and textual formats, showcasing its effectiveness in multi-domain setups.
Computation and Language,Artificial Intelligence,Machine Learning,Quantitative Methods
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to develop a multimodal foundation model capable of processing natural language and chemical data to address various tasks in the fields of chemistry and biology. Specifically, the paper introduces a new foundation model named nach0, which has the following characteristics: 1. **Multimodal Capability**: nach0 can process natural language and chemical language, including the generation and description of molecular structures. 2. **Multi - task Processing**: nach0 can be trained and optimized on multiple tasks, including biomedical question - answering, named - entity recognition, molecular generation, molecular synthesis, property prediction, etc. 3. **Cross - domain Application**: nach0 not only performs well in a single domain but also can generate high - quality outputs in cross - domain tasks, such as generating SMILES strings from text or generating text descriptions from SMILES strings. ### Main Contributions of the Paper 1. **Introduction of the nach0 Model**: Proposed a new multimodal encoder - decoder Transformer model that can be pre - trained on natural language and chemical data. 2. **Multi - task Fine - tuning**: Fine - tune nach0 in a supervised and multi - task manner, using natural language instructions for multiple tasks to guide model training. 3. **Experimental Verification**: Conducted extensive experiments on benchmark datasets, demonstrating the competitiveness of nach0 in single - domain and cross - domain tasks, especially performing excellently in molecular generation and chemical property prediction tasks. ### Specific Problems Solved - **Molecular Generation**: Generate molecular structures with specific properties. - **Molecular Property Prediction**: Predict various chemical and physical properties of molecules. - **Biomedical Question - answering**: Answer complex questions related to biomedicine. - **Named - entity Recognition**: Identify chemical and biological entities in text. - **Reaction Prediction**: Predict the products and reactants of chemical reactions. - **Cross - domain Tasks**: Convert between natural language and chemical data, such as generating molecular structures from text descriptions or generating text descriptions from molecular structures. Through these contributions, nach0 aims to provide a powerful tool for research in the fields of chemistry and biomedicine, accelerating the process of drug discovery and material design.