Abstract:Language-molecule models have emerged as an exciting direction for molecular discovery and understanding. However, training these models is challenging due to the scarcity of molecule-language pair datasets. At this point, datasets have been released which are 1) small and scraped from existing databases, 2) large but noisy and constructed by performing entity linking on the scientific literature, and 3) built by converting property prediction datasets to natural language using templates. In this document, we detail the $\textit{L+M-24}$ dataset, which has been created for the Language + Molecules Workshop shared task at ACL 2024. In particular, $\textit{L+M-24}$ is designed to focus on three key benefits of natural language in molecule design: compositionality, functionality, and abstraction.

What problem does this paper attempt to address?

The main goal of this paper is to introduce and release a new dataset named L+M-24, which aims to promote research and development in the field of Language + Molecules. Specifically, the design of the L+M-24 dataset focuses on addressing the following key issues: 1. **Data Scarcity**: Existing molecule-language pairs datasets are either small in scale or have high noise levels, posing challenges for training high-quality language-molecule models. 2. **Key Characteristics of Natural Language**: Current methods often overlook three important characteristics of natural language in molecular design—Compositionality, Functionality, and Abstraction. These characteristics are crucial for enhancing the model's ability to understand and generate complex molecules. To address the above issues, the authors created the L+M-24 dataset, which has the following features: - **Four Major Application Domains**: The dataset covers four important small molecule application domains: biomedicine, light and electricity, human-computer interaction and sensory experience, agriculture, and industry. - **Compositionality Testing**: The dataset includes molecule pairs with specific attributes that are intentionally excluded to evaluate whether the model can generalize to unseen attribute combinations. - **Task Settings**: The dataset is primarily used for two tasks: generating descriptions based on molecules (captioning) and generating molecules based on descriptions (generation). - **Data Sources**: The L+M-24 dataset integrates information from multiple databases, including PubChem, CheF (Chemical Function), and ChemFOnt. Additionally, the paper presents a series of benchmark experimental results for this dataset and discusses the challenges current models face in handling these tasks. Overall, the L+M-24 dataset aims to advance the cross-field of language and molecules and will be part of a shared task at the Language + Molecules workshop at the 2024 ACL conference.

L+M-24: Building a Dataset for Language + Molecules @ ACL 2024

Benchmarking Large Language Models for Molecule Prediction Tasks

Towards 3D Molecule-Text Interpretation in Language Models

From Words to Molecules: A Survey of Large Language Models in Chemistry

Can Large Language Models Empower Molecular Property Prediction?

Large Language Models as Molecular Design Engines

MoleculeQA: A Dataset to Evaluate Factual Accuracy in Molecular Comprehension

MolX: Enhancing Large Language Models for Molecular Learning with A Multi-Modal Extension

From Artificially Real to Real: Leveraging Pseudo Data from Large Language Models for Low-Resource Molecule Discovery

LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset, Framework, and Benchmark

DataComp-LM: In search of the next generation of training sets for language models

MoleculeCLA: Rethinking Molecular Benchmark via Computational Ligand-Target Binding Analysis

ChemVLM: Exploring the Power of Multimodal Large Language Models in Chemistry Area

MolCap-Arena: A Comprehensive Captioning Benchmark on Language-Enhanced Molecular Property Prediction

LlaSMol: Advancing Large Language Models for Chemistry with a Large-Scale, Comprehensive, High-Quality Instruction Tuning Dataset

MolTC: Towards Molecular Relational Modeling In Language Models

Less for More: Enhanced Feedback-aligned Mixed LLMs for Molecule Caption Generation and Fine-Grained NLI Evaluation

Empowering Molecule Discovery for Molecule-Caption Translation with Large Language Models: A ChatGPT Perspective

MMSci: A Dataset for Graduate-Level Multi-Discipline Multimodal Scientific Understanding

Less for More: Enhanced Feedback-aligned Mixed LLMs for Molecule Caption Generation and Fine-Grained NLI Evaluation

MolLM : a unified language model for integrating biomedical text with 2D and 3D molecular representations