Abstract:Generating molecular graphs is crucial in drug design and discovery but remains challenging due to the complex interdependencies between nodes and edges. While diffusion models have demonstrated their potentiality in molecular graph design, they often suffer from unstable training and inefficient sampling. To enhance generation performance and training stability, we propose GGFlow, a discrete flow matching generative model incorporating optimal transport for molecular graphs and it incorporates an edge-augmented graph transformer to enable the direct communications among chemical bounds. Additionally, GGFlow introduces a novel goal-guided generation framework to control the generative trajectory of our model, aiming to design novel molecular structures with the desired properties. GGFlow demonstrates superior performance on both unconditional and conditional molecule generation tasks, outperforming existing baselines and underscoring its effectiveness and potential for wider application.
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the challenge of generating molecular graphs in drug design and discovery. Specifically, the paper points out that although diffusion models have shown potential in molecular graph design, they generally have problems of unstable training and inefficient sampling. To improve the generation performance and training stability, the authors propose GGFlow, a discrete - flow - matching generative model combined with optimal transport, specifically designed for the generation of molecular graphs. In addition, GGFlow introduces a new goal - oriented generation framework to control the generation trajectory of the model through reinforcement learning, aiming to design new molecular structures with specific properties.
### Main contributions:
1. **GGFlow**: For the first time, a generative model combining discrete - flow - matching and optimal transport for molecular graph data is proposed, which improves sampling efficiency and training stability. The model also integrates an edge - enhanced graph transformer to enhance the effect of the generation task.
2. **Goal - oriented framework**: A new guiding framework using reinforcement learning is proposed to control the probability flow during the molecular graph generation process to achieve the goal of specific attributes.
3. **Superior performance**: In unconditional and conditional molecular graph generation tasks, GGFlow shows state - of - the - art performance, always outperforming existing methods and performing consistently on different graph types and complexities.
### Method overview:
- **Discrete - flow - matching**: Through the discrete - flow - matching technique, the generation process is converted from stochastic differential equations (SDEs) to ordinary differential equations (ODEs), thereby improving the generation efficiency.
- **Optimal transport**: Using the optimal transport technique, the training variance is reduced and the sampling speed is accelerated.
- **Edge - enhanced graph transformer**: A triangular attention mechanism and additional graph features, such as the number of rings and connected components, are introduced to more effectively capture the joint distribution of the graph.
- **Goal - oriented generation**: A reinforcement learning method is used, and the generation process is guided by a reward function to make it closer to the target distribution.
### Experimental results:
- **Molecular graph generation**: Experiments on the QM9 and ZINC250k datasets show that GGFlow significantly outperforms the baseline models in terms of effectiveness, NSPDK, and FCD metrics.
- **General graph generation**: Experiments on general graph generation benchmarks such as Ego - small, Community - small, Grid, and Planar also prove the superior performance of GGFlow.
- **Conditional molecular generation**: In conditional generation tasks, the guiding method using reinforcement learning significantly outperforms supervised training and supervised fine - tuning methods, showing stronger generation ability and higher effectiveness.
### Conclusion:
GGFlow effectively solves the problems of unstable training and inefficient sampling in molecular graph generation by combining discrete - flow - matching and optimal transport techniques. In addition, by introducing a goal - oriented generation framework, GGFlow can generate molecules with specific properties in conditional generation tasks, showing its great potential in the field of drug design and discovery. Future work will focus on improving the scalability of the model on larger graphs (such as protein graphs).