Widespread misinterpretation of pKa terminology for zwitterionic compounds and its consequences

Jonathan Zheng,Ivo Leito,William Green
DOI: https://doi.org/10.26434/chemrxiv-2024-msd0q-v3
2024-10-10
Abstract:The acid dissociation constant (pK a), which quantifies the propensity for a solute to donate a proton to its solvent, is crucial for drug design and synthesis, environmental fate studies, chemical manufacturing, and many other fields. Unfortunately, the terminology used for describing acid base phenomena is inconsistent, causing large potential for misinterpretation. In this work, we examine a systematic confusion underlying the definition of “acidic” and “basic” pKa values for zwitterionic compounds. Due to this confusion, some pKa data is misrepresented in data repositories, including the widely- used and highly trusted ChEMBL Database. Such datasets are widely used to supply training data for pKa prediction models, and hence, confusion and errors in the data makes model performance worse. Herein, we discuss the intricacies of this issue. We make suggestions for describing acid-base phenomena, training pKa prediction models, and stewarding pKa datasets, given the high potential for confusion and potentially high impact of accurately describing acid-base phenomena.
Chemistry
What problem does this paper attempt to address?
This paper attempts to address the confusion surrounding the term acid dissociation constant (pKa) in zwitterionic compounds. Specifically: 1. **Terminology Confusion**: For zwitterionic compounds (such as amino acids), their low pKa values are often incorrectly labeled as "acidic," while high pKa values are incorrectly labeled as "basic." This labeling is contrary to the traditional definitions of acids and bases, leading to inconsistencies in data repositories (such as the widely used ChEMBL database). 2. **Data Contamination**: Due to the confusion in terminology, datasets used to train pKa prediction models have been contaminated, thereby affecting the performance of the models. For example, the "most acidic" and "most basic" pKa values reported in the ChEMBL database actually correspond to the least basic and least acidic macroscopic values, rather than the expected transitions of +1 and -1 charge states. 3. **Downstream Effects**: These errors not only affect the prediction of ADME properties in drug design but also impact blood-brain barrier permeability, protein binding, solubility, and other aspects. Additionally, kinetic simulations may also be affected, thereby influencing drug development and other chemical applications. In summary, this paper aims to highlight the issue of terminology confusion and propose corresponding solutions to improve future research work.