Generating Highly Designable Proteins with Geometric Algebra Flow Matching

Simon Wagner,Leif Seute,Vsevolod Viliuga,Nicolas Wolf,Frauke Gräter,Jan Stühmer
2024-11-08
Abstract:We introduce a generative model for protein backbone design utilizing geometric products and higher order message passing. In particular, we propose Clifford Frame Attention (CFA), an extension of the invariant point attention (IPA) architecture from AlphaFold2, in which the backbone residue frames and geometric features are represented in the projective geometric algebra. This enables to construct geometrically expressive messages between residues, including higher order terms, using the bilinear operations of the algebra. We evaluate our architecture by incorporating it into the framework of FrameFlow, a state-of-the-art flow matching model for protein backbone generation. The proposed model achieves high designability, diversity and novelty, while also sampling protein backbones that follow the statistical distribution of secondary structure elements found in naturally occurring proteins, a property so far only insufficiently achieved by many state-of-the-art generative models.
Machine Learning
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to generate protein backbone structures with high designability, diversity, and novelty, while ensuring that the generated protein backbones follow the statistical distribution of secondary - structure elements (such as α - helices and β - sheets) in natural proteins. Existing generative models often over - represent α - helices when generating small proteins and fail to fully capture the widely - existing secondary - structure distribution in natural proteins. Therefore, this paper proposes a new method based on Geometric Algebra Flow Matching (GAFL) to overcome the limitations of existing models. ### Main problems and goals of the paper 1. **Increase the diversity of protein design**: Generate proteins with diverse structures, avoid over - representing certain specific structures (such as α - helices), and thus be able to explore a broader structural space. 2. **Enhance the novelty of generated proteins**: Generate previously unseen protein structures, which helps to discover new functional proteins. 3. **Maintain a secondary - structure distribution similar to that of natural proteins**: Ensure that the generated protein backbone structures conform to the statistical distribution of common secondary - structure elements in natural proteins, which is crucial for designing proteins with a wide range of functions. 4. **Improve the designability of generated proteins**: Ensure that the generated protein backbones can predict reasonable amino - acid sequences through inverse - folding models (such as ProteinMPNN and ESMFold), and these sequences can refold into structures consistent with the original backbone structures. ### Solutions To achieve the above goals, the paper introduces the Geometric Algebra Flow Matching (GAFL) framework and proposes the Clifford Frame Attention (CFA) mechanism as an extension of the Invariant Point Attention (IPA) architecture in AlphaFold2. Specifically: - **Geometric Algebra (GA)**: Use Projective Geometric Algebra (PGA) to represent the local coordinate systems and geometric features of protein backbone residues. PGA can express geometric objects such as points, lines, and planes, and calculate geometric relationships (such as distances, angles, projections, etc.) through its bilinear operations. - **Clifford Frame Attention (CFA)**: Utilize the multivector in PGA to represent geometric features and construct a higher - order message - passing mechanism through geometric bilinear layers, thereby enhancing the geometric expression ability. CFA can handle complex geometric information during the message - passing process and generate more expressive messages. - **Flow Matching**: Combine the flow - matching framework (such as FrameFlow), and generate protein backbone structures through Continuous Normalizing Flows (CNFs). This framework can learn the probability path from the prior distribution to the target distribution, thereby generating backbone structures that conform to the distribution of natural proteins. ### Experimental results The experimental results show that GAFL is superior to existing models in terms of designability, diversity, and novelty, and can better capture the statistical distribution of secondary - structure elements in natural proteins. In particular, for small proteins with less than 150 residues, GAFL can generate highly - designable backbone structures, and the proportion of β - sheets is close to that of natural proteins, while other models tend to over - represent α - helices. In conclusion, this paper successfully solves the problems existing in existing protein - generation models by introducing geometric algebra and an improved message - passing mechanism, providing new tools and methods for the field of protein design.