Abstract:Large Language Models (LLMs) have substantially driven scientific progress in various domains, and many papers have demonstrated their ability to tackle complex problems with creative solutions. Our paper introduces a new foundation model, nach0, capable of solving various chemical and biological tasks: biomedical question answering, named entity recognition, molecular generation, molecular synthesis, attributes prediction, and others. nach0 is a multi-domain and multi-task encoder-decoder LLM pre-trained on unlabeled text from scientific literature, patents, and molecule strings to incorporate a range of chemical and linguistic knowledge. We employed instruction tuning, where specific task-related instructions are utilized to fine-tune nach0 for the final set of tasks. To train nach0 effectively, we leverage the NeMo framework, enabling efficient parallel optimization of both base and large model versions. Extensive experiments demonstrate that our model outperforms state-of-the-art baselines on single-domain and cross-domain tasks. Furthermore, it can generate high-quality outputs in molecular and textual formats, showcasing its effectiveness in multi-domain setups.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is to develop a multimodal foundation model capable of processing natural language and chemical data to address various tasks in the fields of chemistry and biology. Specifically, the paper introduces a new foundation model named nach0, which has the following characteristics: 1. **Multimodal Capability**: nach0 can process natural language and chemical language, including the generation and description of molecular structures. 2. **Multi - task Processing**: nach0 can be trained and optimized on multiple tasks, including biomedical question - answering, named - entity recognition, molecular generation, molecular synthesis, property prediction, etc. 3. **Cross - domain Application**: nach0 not only performs well in a single domain but also can generate high - quality outputs in cross - domain tasks, such as generating SMILES strings from text or generating text descriptions from SMILES strings. ### Main Contributions of the Paper 1. **Introduction of the nach0 Model**: Proposed a new multimodal encoder - decoder Transformer model that can be pre - trained on natural language and chemical data. 2. **Multi - task Fine - tuning**: Fine - tune nach0 in a supervised and multi - task manner, using natural language instructions for multiple tasks to guide model training. 3. **Experimental Verification**: Conducted extensive experiments on benchmark datasets, demonstrating the competitiveness of nach0 in single - domain and cross - domain tasks, especially performing excellently in molecular generation and chemical property prediction tasks. ### Specific Problems Solved - **Molecular Generation**: Generate molecular structures with specific properties. - **Molecular Property Prediction**: Predict various chemical and physical properties of molecules. - **Biomedical Question - answering**: Answer complex questions related to biomedicine. - **Named - entity Recognition**: Identify chemical and biological entities in text. - **Reaction Prediction**: Predict the products and reactants of chemical reactions. - **Cross - domain Tasks**: Convert between natural language and chemical data, such as generating molecular structures from text descriptions or generating text descriptions from molecular structures. Through these contributions, nach0 aims to provide a powerful tool for research in the fields of chemistry and biomedicine, accelerating the process of drug discovery and material design.

nach0: Multimodal Natural and Chemical Languages Foundation Model

nach0-pc: Multi-task Language Model with Molecular Point Cloud Encoder

A Large Encoder-Decoder Family of Foundation Models For Chemical Language

MolX: Enhancing Large Language Models for Molecular Learning with A Multi-Modal Extension

ChemVLM: Exploring the Power of Multimodal Large Language Models in Chemistry Area

Are large language models superhuman chemists?

Less for More: Enhanced Feedback-aligned Mixed LLMs for Molecule Caption Generation and Fine-Grained NLI Evaluation

Less for More: Enhanced Feedback-aligned Mixed LLMs for Molecule Caption Generation and Fine-Grained NLI Evaluation

What can Large Language Models do in chemistry? A comprehensive benchmark on eight tasks

SynAsk: Unleashing the Power of Large Language Models in Organic Synthesis

Multimodal Large Language Models for Inverse Molecular Design with Retrosynthetic Planning

Fine-tuning Large Language Models for Chemical Text Mining

From Words to Molecules: A Survey of Large Language Models in Chemistry

Large Language Models as Molecular Design Engines

MolMetaLM: a Physicochemical Knowledge-Guided Molecular Meta Language Model

MolCA: Molecular Graph-Language Modeling with Cross-Modal Projector and Uni-Modal Adapter

Augmenting large language models with chemistry tools

ChemEval: A Comprehensive Multi-Level Chemical Evaluation for Large Language Models

LMM Chemical Research with Document Retrieval

ChemDFM: A Large Language Foundation Model for Chemistry