Abstract:Background: As large language models continue to expand in size and diversity, their substantial potential and the relevance of their applications are increasingly being acknowledged. The rapid advancement of these models also holds profound implications for the long-term design of stimulus-responsive materials used in drug delivery. Methods: The large model used Hugging Face's Transformers package with BigBird, Gemma, and GPT NeoX architectures. Pre-training used the PubChem dataset, and fine-tuning used QM7b. Chemist instruction training was based on Direct Preference Optimization. Drug Likeness, Synthetic Accessibility, and PageRank Scores were used to filter molecules. All computational chemistry simulations were performed using ORCA and Time-Dependent Density-Functional Theory. Results: To optimize large models for extensive dataset processing and comprehensive learning akin to a chemist's intuition, the integration of deeper chemical insights is imperative. Our study initially compared the performance of BigBird, Gemma, GPT NeoX, and others, specifically focusing on the design of photoresponsive drug delivery molecules. We gathered excitation energy data through computational chemistry tools and further investigated light-driven isomerization reactions as a critical mechanism in drug delivery. Additionally, we explored the effectiveness of incorporating human feedback into reinforcement learning to imbue large models with chemical intuition, enhancing their understanding of relationships involving -N=N- groups in the photoisomerization transitions of photoresponsive molecules. Conclusions: We implemented an efficient design process based on structural knowledge and data, driven by large language model technology, to obtain a candidate dataset of specific photoswitchable molecules. However, the lack of specialized domain datasets remains a challenge for maximizing model performance.

LMM Spectrometric Determination of an Organic Compound

High Dimensional and Complex Spectrometric Data Analysis of an Organic Compound using Large Multimodal Models and Chained Outputs

Machine Learning Spectroscopy Using a 2-Stage, Generalized Constituent Contribution Protocol

LMM Chemical Research with Document Retrieval

Molecule Identification with Rotational Spectroscopy and Probabilistic Deep Learning

Near-infrared Spectroscopy and HPLC Combined with Chemometrics for Comprehensive Evaluation of Six Organic Acids in Ginkgo Biloba Leaf Extract.

Benchmarking Large Language Models for Molecule Prediction Tasks

Enhancing Molecular Structure Elucidation: MultiModalTransformer for both simulated and experimental spectra

Accurate and efficient structure elucidation from routine one-dimensional NMR spectra using multitask machine learning

Leveraging Pre-Trained LMs for Rapid and Accurate Structure Elucidation from 2D NMR Data

Elucidating Structures of Complex Organic Compounds Using a Machine Learning Model Based on the 13C NMR Chemical Shifts

Genome-inspired molecular identification in organic matter via Raman spectroscopy

Extracting Structured Data from Organic Synthesis Procedures Using a Fine-Tuned Large Language Model

Chemical Structure Elucidation from Mass Spectrometry by Matching Substructures

MassSpecGym: A benchmark for the discovery and identification of molecules

Machine Learning in Complex Organic Mixtures: Applying Domain Knowledge Allows for Meaningful Performance with Small Datasets.

Machine Learning in Complex Organic Mixtures: Applying Domain Knowledge Allows for Meaningful Performance with Small Data Sets

Discovering Photoswitchable Molecules for Drug Delivery with Large Language Models and Chemist Instruction Training

MolX: Enhancing Large Language Models for Molecular Learning with A Multi-Modal Extension

Chemical Space Localization for Unknown Metabolite Annotation via Semantic Similarity of Mass Spectral Language

Machine Learning for Screening Large Organic Molecules