Abstract:Scientific language models drive research innovation but require extensive fine-tuning on large datasets. This work enhances such models by improving their inference and evaluation capabilities with minimal or no additional training. Focusing on molecule caption generation, we explore synergies between alignment fine-tuning and model merging in a cross-modal setup. We reveal intriguing insights into the behaviour and suitability of such methods while significantly surpassing state-of-the-art models. Moreover, we propose a novel atomic-level evaluation method leveraging off-the-shelf Natural Language Inference (NLI) models for use in the unseen chemical domain. Our experiments demonstrate that our evaluation operates at the right level of granularity, effectively handling multiple content units and subsentence reasoning, while widely adopted NLI methods consistently misalign with assessment criteria.

What problem does this paper attempt to address?

### Problems the Paper Attempts to Solve The paper aims to address several key issues in molecular description generation tasks using chemical language models: 1. **Over-reliance on Large-scale Training**: Existing chemical language models typically require vast amounts of data and extensive fine-tuning to achieve good performance. This approach is not only costly but also prone to overfitting, resulting in poor performance on unseen data. 2. **Limitations of Alignment Fine-tuning Methods**: Traditional alignment fine-tuning methods (such as reinforcement learning based on human feedback) are limited in handling cross-modal tasks, especially when training data is scarce. 3. **Inadequacies in Evaluation Methods**: Current natural language inference (NLI) evaluation methods have limitations when applied in the chemical domain, failing to accurately capture the subtle differences and complexities of generated text. ### Solutions To address the above issues, the authors propose the following solutions: 1. **Synergy of Model Fusion and Alignment Fine-tuning**: - **Model Fusion**: By merging pre-trained molecular generation and description generation models, a multifunctional cross-modal model is constructed. This approach can improve model performance without accessing the original training data. - **Alignment Fine-tuning**: Using a small amount of training data, techniques such as behavior cloning and supervised policy optimization are employed to fine-tune the fused model, making it better suited for specific tasks. 2. **Atomic-level Cross-modal NLI Evaluation**: - A new atomic-level cross-modal NLI evaluation method is proposed. By decomposing the reference text and generated text into atomic premises and hypotheses, and using an NLI model to calculate the probability distribution of contradictions and entailments, the quality of the generated text can be more accurately assessed. ### Main Contributions 1. **Reduction in Training Data Requirements**: Through model fusion and alignment fine-tuning, the model's performance can be significantly improved with a small amount of training data, reducing the reliance on large-scale annotated data. 2. **Improved Generalization Ability**: Experimental results show that the proposed method outperforms extensively trained models on unseen data, enhancing the model's generalization ability. 3. **Improved Evaluation Methods**: The proposed atomic-level cross-modal NLI evaluation method can more accurately capture the subtle differences in generated text, providing a more reliable evaluation standard. ### Experimental Results - **Molecular Description Generation**: On 3000 unseen samples, the CPO method improved molecular description generation performance by 20% compared to the baseline model Meditron. - **Molecular Generation**: On 3000 unseen samples, the CPO method improved molecular generation performance by 42% compared to the baseline model Meditron. - **Comparison of Evaluation Methods**: The atomic-level NLI evaluation method excelled in coverage and hallucination detection, more accurately distinguishing the quality of generated text. ### Conclusion Through the synergy of model fusion and alignment fine-tuning, the paper successfully addresses the key issues in molecular description generation tasks using chemical language models, providing a cost-effective and highly generalizable approach. Additionally, the proposed atomic-level cross-modal NLI evaluation method offers a new perspective for assessing the quality of generated text.

Less for More: Enhanced Feedback-aligned Mixed LLMs for Molecule Caption Generation and Fine-Grained NLI Evaluation

Less for More: Enhanced Feedback-aligned Mixed LLMs for Molecule Caption Generation and Fine-Grained NLI Evaluation

Feedback-aligned Mixed LLMs for Machine Language-Molecule Translation

Large Language Models are In-Context Molecule Learners

MolX: Enhancing Large Language Models for Molecular Learning with A Multi-Modal Extension

MolReFlect: Towards In-Context Fine-grained Alignments between Molecules and Texts

ALMol: Aligned Language-Molecule Translation LLMs through Offline Preference Contrastive Optimisation

Empowering Molecule Discovery for Molecule-Caption Translation with Large Language Models: A ChatGPT Perspective

Benchmarking Large Language Models for Molecule Prediction Tasks

MolCap-Arena: A Comprehensive Captioning Benchmark on Language-Enhanced Molecular Property Prediction

Fine-grained LLM Agent: Pinpointing and Refining Large Language Models via Fine-Grained Actionable Feedback

Utilizing Large Language Models in an Iterative Paradigm with Domain Feedback for Zero-shot Molecule Optimization

Towards 3D Molecule-Text Interpretation in Language Models

Fine-tuning Large Language Models for Chemical Text Mining

Chemical Language Model Linker: blending text and molecules with modular adapters

nach0: Multimodal Natural and Chemical Languages Foundation Model

Crossing New Frontiers: Knowledge-Augmented Large Language Model Prompting for Zero-Shot Text-Based De Novo Molecule Design

Vision Language Model is NOT All You Need: Augmentation Strategies for Molecule Language Models

Leveraging Large Language Models for NLG Evaluation: Advances and Challenges