SALSA: Semantically-Aware Latent Space Autoencoder

Kathryn E. Kirchoff,Travis Maxfield,Alexander Tropsha,Shawn M. Gomez

2023-10-04

Abstract:In deep learning for drug discovery, chemical data are often represented as simplified molecular-input line-entry system (SMILES) sequences which allow for straightforward implementation of natural language processing methodologies, one being the sequence-to-sequence autoencoder. However, we observe that training an autoencoder solely on SMILES is insufficient to learn molecular representations that are semantically meaningful, where semantics are defined by the structural (graph-to-graph) similarities between molecules. We demonstrate by example that autoencoders may map structurally similar molecules to distant codes, resulting in an incoherent latent space that does not respect the structural similarities between molecules. To address this shortcoming we propose Semantically-Aware Latent Space Autoencoder (SALSA), a transformer-autoencoder modified with a contrastive task, tailored specifically to learn graph-to-graph similarity between molecules. Formally, the contrastive objective is to map structurally similar molecules (separated by a single graph edit) to nearby codes in the latent space. To accomplish this, we generate a novel dataset comprised of sets of structurally similar molecules and opt for a supervised contrastive loss that is able to incorporate full sets of positive samples. We compare SALSA to its ablated counterparts, and show empirically that the composed training objective (reconstruction and contrastive task) leads to a higher quality latent space that is more 1) structurally-aware, 2) semantically continuous, and 3) property-aware.

Machine Learning

What problem does this paper attempt to address?

This paper mainly discusses the deep learning problem in drug discovery, particularly how to learn meaningful molecular representations with semantic meanings by improving the Sequence-to-Sequence Autoencoder. The authors observed that the autoencoder trained solely on simplified molecular input line entry system (SMILES) is unable to learn representations that reflect molecular structural similarity. SMILES is a text sequence used for representing chemical structures. The paper proposes a new model called SALSA (Semantic Aware Latent Space Autoencoder), which is a Transformer autoencoder combined with a contrastive learning task. The contrastive learning task aims to map structurally similar molecules (i.e., molecules with a graph edit distance difference of one) to close encodings in the latent space. To achieve this, the researchers created a new dataset containing pairs of structurally similar molecules and used supervised contrastive loss, which can handle multiple positive samples. SALSA is compared with two ablation models (simple SMILES autoencoder and contrastive encoder), and the results show that SALSA has a higher quality latent space that is more focused on structure, semantic continuity, and attribute awareness. Through these improvements, SALSA is able to better capture the semantics of molecular data, which is crucial for tasks such as property prediction and new molecule generation. In summary, the problem addressed in this paper is how to improve deep learning models to learn meaningful latent representations that reflect molecular structure and chemical properties, in order to improve the performance of key tasks in drug discovery.

SALSA: Semantically-Aware Latent Space Autoencoder

Improving Chemical Autoencoder Latent Space and Molecular De Novo Generation Diversity with Heteroencoders

Small molecule autoencoders: architecture engineering to optimize latent space utility and sustainability

Conditional Latent Space Molecular Scaffold Optimization for Accelerated Molecular Design

Application of Generative Autoencoder in De Novo Molecular Design

Bridging the Semantic Latent Space Between Brain and Machine: Similarity is All You Need

Exploring Molecular Heteroencoders with Latent Space Arithmetic: Atomic Descriptors and Molecular Operators

Exploring Latent Space for Generating Peptide Analogs Using Protein Language Models

Automatic Chemical Design Using a Data-Driven Continuous Representation of Molecules

Exploring the Latent Space of Autoencoders with Interventional Assays

Deep learning to generate in silico chemical property libraries and candidate molecules for small molecule identification in complex samples

MOL-AE: Auto-Encoder Based Molecular Representation Learning With 3D Cloze Test Objective

Latent Space Bayesian Optimization with Latent Data Augmentation for Enhanced Exploration

Masked Molecule Modeling: A New Paradigm of Molecular Representation Learning for Chemistry Understanding

A is for Absorption: Studying Feature Splitting and Absorption in Sparse Autoencoders

SMICLR: Contrastive Learning on Multiple Molecular Representations for Semisupervised and Unsupervised Representation Learning

Residual Stream Analysis with Multi-Layer SAEs

S-MolSearch: 3D Semi-supervised Contrastive Learning for Bioactive Molecule Search

CROMA: Remote Sensing Representations with Contrastive Radar-Optical Masked Autoencoders

Semantic Autoencoder for Zero-Shot Learning

Latents2Semantics: Leveraging the Latent Space of Generative Models for Localized Style Manipulation of Face Images