L+M-24: Building a Dataset for Language + Molecules @ ACL 2024

Carl Edwards,Qingyun Wang,Lawrence Zhao,Heng Ji
2024-07-05
Abstract:Language-molecule models have emerged as an exciting direction for molecular discovery and understanding. However, training these models is challenging due to the scarcity of molecule-language pair datasets. At this point, datasets have been released which are 1) small and scraped from existing databases, 2) large but noisy and constructed by performing entity linking on the scientific literature, and 3) built by converting property prediction datasets to natural language using templates. In this document, we detail the $\textit{L+M-24}$ dataset, which has been created for the Language + Molecules Workshop shared task at ACL 2024. In particular, $\textit{L+M-24}$ is designed to focus on three key benefits of natural language in molecule design: compositionality, functionality, and abstraction.
Computation and Language,Artificial Intelligence,Biomolecules,Quantitative Methods
What problem does this paper attempt to address?
The main goal of this paper is to introduce and release a new dataset named L+M-24, which aims to promote research and development in the field of Language + Molecules. Specifically, the design of the L+M-24 dataset focuses on addressing the following key issues: 1. **Data Scarcity**: Existing molecule-language pairs datasets are either small in scale or have high noise levels, posing challenges for training high-quality language-molecule models. 2. **Key Characteristics of Natural Language**: Current methods often overlook three important characteristics of natural language in molecular design—Compositionality, Functionality, and Abstraction. These characteristics are crucial for enhancing the model's ability to understand and generate complex molecules. To address the above issues, the authors created the L+M-24 dataset, which has the following features: - **Four Major Application Domains**: The dataset covers four important small molecule application domains: biomedicine, light and electricity, human-computer interaction and sensory experience, agriculture, and industry. - **Compositionality Testing**: The dataset includes molecule pairs with specific attributes that are intentionally excluded to evaluate whether the model can generalize to unseen attribute combinations. - **Task Settings**: The dataset is primarily used for two tasks: generating descriptions based on molecules (captioning) and generating molecules based on descriptions (generation). - **Data Sources**: The L+M-24 dataset integrates information from multiple databases, including PubChem, CheF (Chemical Function), and ChemFOnt. Additionally, the paper presents a series of benchmark experimental results for this dataset and discusses the challenges current models face in handling these tasks. Overall, the L+M-24 dataset aims to advance the cross-field of language and molecules and will be part of a shared task at the Language + Molecules workshop at the 2024 ACL conference.