Generative Flows on Discrete State-Spaces: Enabling Multimodal Flows with Applications to Protein Co-Design

Andrew Campbell,Jason Yim,Regina Barzilay,Tom Rainforth,Tommi Jaakkola
2024-06-06
Abstract:Combining discrete and continuous data is an important capability for generative models. We present Discrete Flow Models (DFMs), a new flow-based model of discrete data that provides the missing link in enabling flow-based generative models to be applied to multimodal continuous and discrete data problems. Our key insight is that the discrete equivalent of continuous space flow matching can be realized using Continuous Time Markov Chains. DFMs benefit from a simple derivation that includes discrete diffusion models as a specific instance while allowing improved performance over existing diffusion-based approaches. We utilize our DFMs method to build a multimodal flow-based modeling framework. We apply this capability to the task of protein co-design, wherein we learn a model for jointly generating protein structure and sequence. Our approach achieves state-of-the-art co-design performance while allowing the same multimodal model to be used for flexible generation of the sequence or structure.
Machine Learning,Quantitative Methods
What problem does this paper attempt to address?
### Problems the paper attempts to solve The paper aims to solve the problem of expanding the capabilities of multimodal generative models when dealing with discrete and continuous data. Specifically, the paper proposes a new method named **Discrete Flow Models (DFMs)**, which is used to generate discrete data and combines it with existing continuous flow models to form a multimodal generative framework. This framework is especially suitable for the protein co - design task, that is, generating the structure and sequence of proteins simultaneously. ### Background and motivation 1. **Requirements for multimodal generative models**: - Generative models need to be able to handle discrete and continuous data, which is crucial in many scientific applications. For example, in protein co - design, it is necessary to generate continuous protein structures and discrete amino acid sequences simultaneously. 2. **Limitations of existing methods**: - **Diffusion Models**: Although they perform well in multiple applications, their inflexibility in sample time makes them unsuitable for multimodal problems. Retraining and evaluation to find the optimal sampling parameters are very time - consuming. - **Flow Models**: Compared with diffusion models, flow models have a simpler framework and better sampling flexibility, but their ability to define flow models in discrete spaces is limited. ### Solutions 1. **Discrete Flow Models (DFMs)**: - The paper introduces DFMs, which is a discrete data - generating model based on Continuous Time Markov Chains (CTMCs). - DFMs generate new data points by simulating probability flows, allowing the distribution properties of samples, such as secondary structure composition and diversity, to be adjusted during inference. 2. **Multimodal generative framework**: - Combining DFMs with continuous flow models forms a multimodal generative framework. - Applying this framework to the protein co - design task, a new generative model named **Multiflow** has been developed. Multiflow can jointly generate the structure and sequence of proteins and can generate one modality given the other. ### Experimental results 1. **Experiments on small - scale text data**: - It has been verified that DFMs are superior to the discrete diffusion model D3PM (Austin et al., 2021) in terms of sample - time flexibility. 2. **Protein co - design task**: - Multiflow has achieved state - of - the - art performance in the protein co - design task and has also obtained state - of - the - art structure - generation performance through the data distillation method. - By adjusting the CTMC randomness, sample properties such as secondary structure composition and diversity can be controlled. - Preliminary results show that Multiflow also shows potential in reverse - folding and forward - folding tasks and is expected to become a general - purpose protein - generating model. ### Main contributions 1. **Proposing DFMs**: A discrete data - generating model based on CTMC - based probability - flow simulation. 2. **Constructing a multimodal generative framework**: Combining DFMs with continuous flow models to form a multimodal generative framework. 3. **Developing Multiflow**: An advanced protein co - design generative model with the flexibility of multimodal protein generation. Through these contributions, the paper provides a new solution for multimodal generative models in dealing with discrete and continuous data, especially showing great potential in the application of protein co - design.