Con-CDVAE is a diffusion based model which can generates crystals according to the physical properties you need. It was developed based on CDVAE and inspired by DALL·E2.


alt Figure 1: Training and generation flow chart of Con-CDVAE.

Figure 1 shows the model framework of Con-CDVAE and the process of training the model and generating the crystals. We use the VAE structure as CDVAE did. The Encoder is a graph model that can convert crystals into latent variables. The Decoder consists of some MLPs and another graph model. MLP is used to generate lattice constant, number of atoms. Given these constants, the graph model is used to generate atomic position and atomic species in diffusion way. At the same time, we use the properties of the crystal as input to the Decoder as well, enabling it to generate crystal with the target properties we set.

Note that we use Predictor, which consists of MLPs, to predict crystal properties. This block use latent variables as input to make crystals with similar properties close in the latent space which may be helpful when use Prior to generate new latent variable with properties.

Prior block is inspired by DALL·E2. It is another diffusion model composed of MLPs. Prior using properties as input samples the crystal latent variables from the latent variable space. Because Prior requires crystal latent variables as labels, the training of Con-CDVAE needs to be done in two steps.

After training, Con-CDVAE can generate crystals based on target properties. First, Prior will sample latent variables based on the properties. Then the latent variables and properties will be input to the Decoder to generate crystals.


More details can be found in the paper and github:
Paper: Con-CDVAE: A method for the conditional generation of crystal structures
Code: Con-CDVAE



Let's start to train a Con-CDVAE which can generate crystals based on formation energy. In this notebook, I just use the toy datasets, mptest, and only trained for a few epochs to show how to run the code. If you want to train a useful Con-CDVAE you should train the model with enough epochs and a large enough dataset, such as download from CDVAE, or Materials Project, and so on.

  1. First, after downloading the code from github you need to build the environment. We recommend using conda to do it.
!git clone
Cloning into 'Con-CDVAE'...
remote: Enumerating objects: 265, done.
remote: Counting objects: 100% (178/178), done.
remote: Compressing objects: 100% (148/148), done.
remote: Total 265 (delta 42), reused 86 (delta 18), pack-reused 87
Receiving objects: 100% (265/265), 52.48 MiB | 4.87 MiB/s, done.
Resolving deltas: 100% (49/49), done.
!conda env create -f Con-CDVAE/environment.yml
Collecting package metadata (repodata.json): \ Killed
  1. And modify the following environment variables in .env.
  • PROJECT_ROOT: path to the folder that contains this repo
  • HYDRA_JOBS: path to a folder to store hydra outputs
  • WABDB: path to a folder to store wabdb outputs
  1. Step-one training

    To train a Con-CDVAE, run the following command first.

    After training, model checkpoints can be found in$HYDRA_JOBS/singlerun/YYYY-MM-DD/model_expname.pth.

%cd /Con-CDVAE/
!python concdvae/ train=new data=mptest expname=test model=vae_mp_format
  1. Step-two training

    After finishing step-one training, you can train the Prior block with the following command.

    Then you can get the default condition Prior in /your_path_to_model_checkpoints/conz_model_your_label_diffu.pth.

!python scripts/ --model_path /Con-CDVAE/output/hydra/singlerun/2024-08-05/test --model_file model_test.pth --fullfea 0 --label conztest


To generate materials, you should prepare condition file. You can see the example in /output/hydra/singlerun/2024-01-25/test/, where "general_full.csv" is for default strategy or full strategy, and "general_less.csv" is for less strategy.

For simplicity, we copy .../2024-01-25/test/general_full.csv directly into the new model_path, then run the following command:

!python scripts/ --model_path /Con-CDVAE/output/hydra/singlerun/2024-08-05/test --model_file model_test.pth --conz_file conz_model_conztest_diffu.pth --label test --prop_path general_full.csv
If you want to filter latent variables using the Predictor block, set --down_sample 100 which means filtering at a ratio of one hundred to one.

The generated crystals are stored in The number of structures under each is (num_batches_to_samples * batch_size / down_sample). So you can change the setting to control the number of generated structures.

For example, by using --num_batches_to_samples 10 --batch_size 500 --down_sample 1 you will get 5000 structures under each

Then you can use Con-CDVAE/scripts/ to get the cif from

