Alchemy: A Quantum Chemistry Dataset for Benchmarking AI Models

Guangyong Chen,Pengfei Chen,Chang-Yu Hsieh,Chee-Kong Lee,Benben Liao,Renjie Liao,Weiwen Liu,Jiezhong Qiu,Qiming Sun,Jie Tang,Richard Zemel,Shengyu Zhang
DOI: https://doi.org/10.48550/arXiv.1906.09427
2019-06-22
Abstract:We introduce a new molecular dataset, named Alchemy, for developing machine learning models useful in chemistry and material science. As of June 20th 2019, the dataset comprises of 12 quantum mechanical properties of 119,487 organic molecules with up to 14 heavy atoms, sampled from the GDB MedChem database. The Alchemy dataset expands the volume and diversity of existing molecular datasets. Our extensive benchmarks of the state-of-the-art graph neural network models on Alchemy clearly manifest the usefulness of new data in validating and developing machine learning models for chemistry and material science. We further launch a contest to attract attentions from researchers in the related fields. More details can be found on the contest website \footnote{<a class="link-external link-https" href="https://alchemy.tencent.com" rel="external noopener nofollow">this https URL</a>}. At the time of benchamrking experiment, we have generated 119,487 molecules in our Alchemy dataset. More molecular samples are generated since then. Hence, we provide a list of molecules used in the reported benchmarks.
Machine Learning
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the insufficiency of existing molecular datasets in scale and diversity, which restricts the development of machine learning (ML) models in chemistry and materials science. Specifically: 1. **Limitations of existing datasets**: - Existing molecular datasets such as QM9, although containing a large number of molecules, are limited to hydrogen (H), carbon (C), nitrogen (N), oxygen (O) and fluorine (F) in molecular composition, and the molecular size is also small. - The amount of data is relatively limited and is not sufficient to fully train and validate complex deep - learning models. 2. **The need for high - quality data**: - High - quality data is crucial for developing high - performance machine - learning models. For example, the importance of the ImageNet dataset in the field of image recognition and the role of the SQuAD dataset in natural language processing. - In the field of chemistry, although the existing MoleculeNet dataset provides some supervised learning tasks, the amount and diversity of data are still insufficient. 3. **Creation of a new dataset**: - To solve the above problems, the authors created a new molecular dataset named Alchemy, which contains 12 quantum - mechanical properties of 119,487 organic molecules. These molecules contain at most 14 heavy atoms (C, N, O, F, S and Cl) and are sourced from the GDB MedChem database. - The Alchemy dataset expands the volume and diversity of molecular data, covering more atom types and larger molecular structures, which helps to more comprehensively evaluate and develop machine - learning models. 4. **Promotion of model development and evaluation**: - By introducing the Alchemy dataset, researchers can better validate and develop machine - learning models for chemistry and materials science, especially models such as graph neural networks (GNN). - The authors also held a molecular property prediction competition based on the Alchemy dataset to attract more researchers to participate in research in this field. In conclusion, this paper aims to overcome the limitations of existing datasets by creating a larger and more diverse molecular dataset, thereby promoting the application and development of machine learning in chemistry and materials science.