Assessing the feasibility of statistical inference using synthetic antibody-antigen datasets

Thomas Minotto,Philippe A. Robert,Ingrid Hobæk Haff,Geir K. Sandve
DOI: https://doi.org/10.1515/sagmb-2023-0027
2024-04-03
Statistical Applications in Genetics and Molecular Biology
Abstract:Simulation frameworks are useful to stress-test predictive models when data is scarce, or to assert model sensitivity to specific data distributions. Such frameworks often need to recapitulate several layers of data complexity, including emergent properties that arise implicitly from the interaction between simulation components. Antibody-antigen binding is a complex mechanism by which an antibody sequence wraps itself around an antigen with high affinity. In this study, we use a synthetic simulation framework for antibody-antigen folding and binding on a 3D lattice that include full details on the spatial conformation of both molecules. We investigate how emergent properties arise in this framework, in particular the physical proximity of amino acids, their presence on the binding interface, or the binding status of a sequence, and relate that to the individual and pairwise contributions of amino acids in statistical models for binding prediction. We show that weights learnt from a simple logistic regression model align with some but not all features of amino acids involved in the binding, and that predictive sequence binding patterns can be enriched. In particular, main effects correlated with the capacity of a sequence to bind any antigen, while statistical interactions were related to sequence specificity.
statistics & probability,mathematical & computational biology,biochemistry & molecular biology
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: how to use synthetic antibody - antigen data sets to evaluate the feasibility of statistical inference, especially when data is scarce or when it is necessary to test the sensitivity of prediction models to specific data distributions. Through a synthetic simulation framework that includes the three - dimensional lattice folding and binding of antibodies and antigens, the research explored the emergent properties that appear in this framework, especially the physical proximity of amino acids, their presence at the binding interface, and the binding state of the sequence, and related these properties to the individual and pairwise contributions of amino acids in the statistical model. Specifically, the research focused on the following points: 1. **Generation of synthetic data**: A synthetic antibody - antigen binding simulation framework was constructed, which can describe the spatial conformation of molecules in detail and generate a large number of synthetic data sets with known ground - truth structural properties. 2. **Analysis of emergent properties**: The research studied how emergent properties appear in the simulation framework, such as the physical proximity of amino acids, whether they are located on the binding interface, and the binding state of the sequence, etc., and explored the relationships between these properties and the individual and pairwise effects of amino acids. 3. **Application of statistical models**: A simple logistic regression model was used to learn weights and evaluate the alignment of these weights with the characteristics of amino acids involved in binding. The research also explored the correlation between the main effects and the ability of the sequence to bind any antigen, as well as the relationship between statistical interactions and sequence specificity. 4. **Association of high - order properties**: Higher - order properties related to binding, such as affinity, adhesiveness, and specificity, were further studied, and these properties were related to the contributions of the main effects and statistical interactions to improve the accuracy of binding prediction. In summary, this paper aims to evaluate the effectiveness of statistical inference tools through synthetic data sets, especially in identifying emergent properties emerging from protein sequence data sets, and these tools can be directly applied to experimental data sets.