Generating QM1B with PySCF$_{\text{IPU}}$

Alexander Mathiasen,Hatem Helal,Kerstin Klaser,Paul Balanca,Josef Dean,Carlo Luschi,Dominique Beaini,Andrew Fitzgibbon,Dominic Masters
2023-11-02
Abstract:The emergence of foundation models in Computer Vision and Natural Language Processing have resulted in immense progress on downstream tasks. This progress was enabled by datasets with billions of training examples. Similar benefits are yet to be unlocked for quantum chemistry, where the potential of deep learning is constrained by comparatively small datasets with 100k to 20M training examples. These datasets are limited in size because the labels are computed using the accurate (but computationally demanding) predictions of Density Functional Theory (DFT). Notably, prior DFT datasets were created using CPU supercomputers without leveraging hardware acceleration. In this paper, we take a first step towards utilising hardware accelerators by introducing the data generator PySCF$_{\text{IPU}}$ using Intelligence Processing Units (IPUs). This allowed us to create the dataset QM1B with one billion training examples containing 9-11 heavy atoms. We demonstrate that a simple baseline neural network (SchNet 9M) improves its performance by simply increasing the amount of training data without additional inductive biases. To encourage future researchers to use QM1B responsibly, we highlight several limitations of QM1B and emphasise the low-resolution of our DFT options, which also serves as motivation for even larger, more accurate datasets. Code and dataset are available on Github: <a class="link-external link-http" href="http://github.com/graphcore-research/pyscf-ipu" rel="external noopener nofollow">this http URL</a>
Machine Learning,Chemical Physics
What problem does this paper attempt to address?
### Problems Addressed by the Paper This paper aims to address the issue of small dataset sizes in the field of quantum chemistry deep learning. Specifically: 1. **Dataset Size Limitation**: - Current quantum chemistry datasets (such as QM9, ANI-1, etc.) typically contain 100,000 to 20,000,000 training samples. These dataset sizes are relatively small and cannot fully exploit the potential of deep learning models. - The main reason for the limited dataset size is the high computational cost of generating labels, especially when using Density Functional Theory (DFT) for calculations. 2. **Lack of Hardware Acceleration**: - Previous dataset generation mainly relied on CPU supercomputers and did not fully utilize hardware acceleration technologies (such as GPU or IPU), further limiting the dataset size. 3. **Exploration of Low-Resolution Datasets**: - To generate larger datasets, the authors chose to reduce the precision of DFT, thereby generating more training samples with limited computational resources. - A key issue with this approach is whether low-resolution datasets will affect the performance and accuracy of subsequent neural networks. ### Solution 1. **PySCF IPU**: - The authors introduced PySCF IPU, a DFT data generator that utilizes Intel's Intelligence Processing Units (IPU) for hardware acceleration. - By leveraging the high-performance computing capabilities of IPU, the authors successfully generated a dataset containing 1 billion training samples (QM1B), with each molecule containing 9-11 heavy atoms. 2. **Generation of Large-Scale Datasets**: - The generation of the QM1B dataset took only 40,000 IPU hours, much less than the time required to generate a similar-sized dataset on a CPU (e.g., the PCQ dataset took two years to generate). - The authors increased the number of training samples by reducing the precision of DFT, thereby generating a large-scale dataset with limited computational resources. 3. **Experimental Validation**: - The authors trained a simple baseline neural network (SchNet 9M) on different subsets of the QM1B dataset. The results showed that as the number of training samples increased, the mean absolute error (MAE) on the validation set significantly decreased. - Pre-training the SchNet 9M model and fine-tuning it on the QM9 dataset showed that the validation MAE decreased from 54.13 meV to 30.2 meV, demonstrating the effectiveness of the large-scale dataset. ### Future Work 1. **Larger Molecules and Higher-Resolution DFT**: - The authors plan to further optimize PySCF IPU to support larger molecules and higher-precision DFT calculations. - By improving memory management and computational strategies, the authors hope to handle molecules with more atoms and increase the precision of DFT in future versions. 2. **Generation of Downstream Task Datasets**: - The authors hope that researchers can use PySCF IPU to generate datasets specifically for fine-tuning foundational models for downstream tasks, thereby further advancing molecular machine learning. ### Conclusion By introducing PySCF IPU and generating the large-scale QM1B dataset, this paper demonstrates how to overcome the limitations of quantum chemistry dataset sizes using hardware acceleration technology. This advancement is expected to accelerate the design of new drugs and materials, but caution is needed regarding the potential biases introduced by low-resolution datasets.