Feature engineering and selection
Physical and chemical properties of the Lithium-ion silicate cathodes are used to predict the crystal structure of a Lithium-ion battery as monoclinic, orthorhombic and triclinic. This case study demonstrates how feature engineering improves the classification results. See the Li-Ion Feature Engineering case study for additional information.
Background: Lithium-ion batteries are commonly used for portable electronics, electric vehicles, and aerospace applications. During discharge, Lithium ions move from the negative electrode through an electrolyte to the positive electrode to create a voltage and current. During recharging, the ions migrate back to the negative electrode. The crystal structure (monoclinic, orthorhombic, triclinic) is available for 339 different chemicals that contain Li-ion.
Lithium-ion Chemical Properties and Crystal Structure Data
url = 'http://apmonitor.com/pds/uploads/Main/lithium_ion.txt'
Objective: Predict the crystal structure type (monoclinic, orthorhombic, triclinic) from Lithium-ion physical and chemical compound information.
This tutorial covers the following:
- Categorical transformation techniques
- Feature creation
- Feature selection
Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simple Requirement already satisfied: pandas in /opt/mamba/lib/python3.10/site-packages (1.5.3) Requirement already satisfied: python-dateutil>=2.8.1 in /opt/mamba/lib/python3.10/site-packages (from pandas) (2.8.2) Requirement already satisfied: pytz>=2020.1 in /opt/mamba/lib/python3.10/site-packages (from pandas) (2022.7.1) Requirement already satisfied: numpy>=1.21.0 in /opt/mamba/lib/python3.10/site-packages (from pandas) (1.24.2) Requirement already satisfied: six>=1.5 in /opt/mamba/lib/python3.10/site-packages (from python-dateutil>=2.8.1->pandas) (1.16.0) WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simple Requirement already satisfied: seaborn in /opt/mamba/lib/python3.10/site-packages (0.13.0) Requirement already satisfied: matplotlib!=3.6.1,>=3.3 in /opt/mamba/lib/python3.10/site-packages (from seaborn) (3.8.1) Requirement already satisfied: pandas>=1.2 in /opt/mamba/lib/python3.10/site-packages (from seaborn) (1.5.3) Requirement already satisfied: numpy!=1.24.0,>=1.20 in /opt/mamba/lib/python3.10/site-packages (from seaborn) (1.24.2) Requirement already satisfied: pyparsing>=2.3.1 in /opt/mamba/lib/python3.10/site-packages (from matplotlib!=3.6.1,>=3.3->seaborn) (3.1.1) Requirement already satisfied: kiwisolver>=1.3.1 in /opt/mamba/lib/python3.10/site-packages (from matplotlib!=3.6.1,>=3.3->seaborn) (1.4.5) Requirement already satisfied: cycler>=0.10 in /opt/mamba/lib/python3.10/site-packages (from matplotlib!=3.6.1,>=3.3->seaborn) (0.12.1) Requirement already satisfied: packaging>=20.0 in /opt/mamba/lib/python3.10/site-packages (from matplotlib!=3.6.1,>=3.3->seaborn) (23.0) Requirement already satisfied: contourpy>=1.0.1 in /opt/mamba/lib/python3.10/site-packages (from matplotlib!=3.6.1,>=3.3->seaborn) (1.2.0) Requirement already satisfied: fonttools>=4.22.0 in /opt/mamba/lib/python3.10/site-packages (from matplotlib!=3.6.1,>=3.3->seaborn) (4.44.0) Requirement already satisfied: python-dateutil>=2.7 in /opt/mamba/lib/python3.10/site-packages (from matplotlib!=3.6.1,>=3.3->seaborn) (2.8.2) Requirement already satisfied: pillow>=8 in /opt/mamba/lib/python3.10/site-packages (from matplotlib!=3.6.1,>=3.3->seaborn) (10.1.0) Requirement already satisfied: pytz>=2020.1 in /opt/mamba/lib/python3.10/site-packages (from pandas>=1.2->seaborn) (2022.7.1) Requirement already satisfied: six>=1.5 in /opt/mamba/lib/python3.10/site-packages (from python-dateutil>=2.7->matplotlib!=3.6.1,>=3.3->seaborn) (1.16.0) WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simple Requirement already satisfied: scikit-learn in /opt/mamba/lib/python3.10/site-packages (1.3.2) Requirement already satisfied: numpy<2.0,>=1.17.3 in /opt/mamba/lib/python3.10/site-packages (from scikit-learn) (1.24.2) Requirement already satisfied: scipy>=1.5.0 in /opt/mamba/lib/python3.10/site-packages (from scikit-learn) (1.10.1) Requirement already satisfied: threadpoolctl>=2.0.0 in /opt/mamba/lib/python3.10/site-packages (from scikit-learn) (3.2.0) Requirement already satisfied: joblib>=1.1.1 in /opt/mamba/lib/python3.10/site-packages (from scikit-learn) (1.3.2) WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simple Requirement already satisfied: matplotlib in /opt/mamba/lib/python3.10/site-packages (3.8.1) Requirement already satisfied: packaging>=20.0 in /opt/mamba/lib/python3.10/site-packages (from matplotlib) (23.0) Requirement already satisfied: pyparsing>=2.3.1 in /opt/mamba/lib/python3.10/site-packages (from matplotlib) (3.1.1) Requirement already satisfied: numpy<2,>=1.21 in /opt/mamba/lib/python3.10/site-packages (from matplotlib) (1.24.2) Requirement already satisfied: cycler>=0.10 in /opt/mamba/lib/python3.10/site-packages (from matplotlib) (0.12.1) Requirement already satisfied: contourpy>=1.0.1 in /opt/mamba/lib/python3.10/site-packages (from matplotlib) (1.2.0) Requirement already satisfied: fonttools>=4.22.0 in /opt/mamba/lib/python3.10/site-packages (from matplotlib) (4.44.0) Requirement already satisfied: pillow>=8 in /opt/mamba/lib/python3.10/site-packages (from matplotlib) (10.1.0) Requirement already satisfied: python-dateutil>=2.7 in /opt/mamba/lib/python3.10/site-packages (from matplotlib) (2.8.2) Requirement already satisfied: kiwisolver>=1.3.1 in /opt/mamba/lib/python3.10/site-packages (from matplotlib) (1.4.5) Requirement already satisfied: six>=1.5 in /opt/mamba/lib/python3.10/site-packages (from python-dateutil>=2.7->matplotlib) (1.16.0) WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
Materials Id | Formula | Spacegroup | Formation Energy (eV) | E Above Hull (eV) | Band Gap (eV) | Nsites | Density (gm/cc) | Volume | Has Bandstructure | Crystal System | |
---|---|---|---|---|---|---|---|---|---|---|---|
94 | mp-766984 | Li2Fe(Si2O5)3 | P21 | -2.890 | 0.095 | 0.332 | 48 | 2.555 | 621.599 | True | monoclinic |
122 | mp-763385 | Li2Co(Si2O5)2 | P21/c | -2.858 | 0.075 | 3.254 | 68 | 2.429 | 943.786 | False | monoclinic |
132 | mp-763500 | LiCoSiO4 | P21/c | -2.341 | 0.095 | 0.892 | 28 | 3.840 | 273.243 | True | monoclinic |
306 | mp-772589 | Li2Fe(Si2O5)2 | P1 | -2.911 | 0.064 | 3.079 | 34 | 2.633 | 431.422 | False | triclinic |
46 | mp-767077 | Li5Fe(SiO4)2 | C2 | -2.677 | 0.014 | 2.466 | 16 | 2.616 | 174.413 | True | monoclinic |
220 | mp-863888 | Li2Fe(SiO3)2 | Pmn21 | -2.730 | 0.076 | 2.624 | 66 | 3.023 | 731.236 | True | orthorhombic |
186 | mp-762581 | LiFeSiO4 | Pn21a | -2.604 | 0.018 | 2.961 | 28 | 2.890 | 355.979 | True | orthorhombic |
117 | mp-779186 | Li3Co2(SiO4)2 | P21 | -2.431 | 0.064 | 0.022 | 30 | 3.032 | 353.617 | True | monoclinic |
275 | mp-850159 | Li2Mn(Si2O5)2 | P1 | -2.958 | 0.054 | 3.036 | 34 | 2.633 | 430.361 | True | triclinic |
296 | mp-761820 | LiFeSi3O8 | P1 | -2.886 | 0.041 | 3.160 | 26 | 2.703 | 337.873 | True | triclinic |
61 | mp-762613 | Li2Fe2Si2O7 | P21/c | -2.598 | 0.051 | 3.159 | 52 | 3.149 | 619.645 | True | monoclinic |
155 | mp-761666 | Li3Mn(Si2O5)3 | Pcmn | -2.968 | 0.055 | 1.154 | 100 | 2.760 | 1165.318 | False | orthorhombic |
148 | mp-761776 | LiMn(SiO3)2 | Pbca | -2.824 | 0.036 | 0.037 | 80 | 3.343 | 850.626 | False | orthorhombic |
302 | mp-780681 | LiFe3(SiO4)2 | P1 | -2.468 | 0.058 | 0.631 | 42 | 3.133 | 570.329 | True | triclinic |
141 | mp-849238 | Li2MnSiO4 | Pmnb | -2.695 | 0.010 | 2.882 | 32 | 2.970 | 359.824 | True | orthorhombic |
204 | mp-762703 | LiFeSiO4 | P21nb | -2.566 | 0.055 | 2.630 | 28 | 2.882 | 356.872 | True | orthorhombic |
313 | mp-868319 | Li5Fe5Si7O24 | P1 | -2.646 | 0.072 | 2.598 | 41 | 2.546 | 583.402 | True | triclinic |
207 | mp-762570 | Li3FeSi2O7 | Pbnm | -2.691 | 0.057 | 2.512 | 52 | 2.890 | 562.749 | True | orthorhombic |
291 | mp-761459 | LiFeSi3O8 | P1 | -2.896 | 0.032 | 3.342 | 26 | 2.760 | 330.953 | False | triclinic |
168 | mp-775156 | LiMnSiO4 | Pbca | -2.595 | 0.082 | 1.267 | 56 | 2.994 | 683.102 | True | orthorhombic |
Observe datatypes
Materials Id object Formula object Spacegroup object Formation Energy (eV) float64 E Above Hull (eV) float64 Band Gap (eV) float64 Nsites int64 Density (gm/cc) float64 Volume float64 Has Bandstructure bool Crystal System object dtype: object
Materials Id | Formula | Spacegroup | Has Bandstructure | Crystal System | |
---|---|---|---|---|---|
count | 339 | 339 | 339 | 339 | 339 |
unique | 339 | 114 | 44 | 2 | 3 |
top | mp-849394 | LiFeSiO4 | P1 | True | monoclinic |
freq | 1 | 42 | 72 | 274 | 139 |
Categorical encoding methods
1. One Hot Encoding
Method: Encode each category value into a binary vector, with size = # of distinct values. See https://towardsdatascience.com/understanding-feature-engineering-part-2-categorical-data-f54324193e63
Example: Has Bandstructure column has 2 distinct values, True and False. Create a new column where 1 = True and 0 = False.
Pros: simple and rugged method to get categorical features into unique and useful numerical features
Cons: m unique values results in m unique new features. This is fine when there are only 2-3 unique values (such as hi/lo, yes/no), but creates issues when there are more. Can't handle new categories that weren't in training data, and easily overfit. Sparse data.
2. Encode to ordinal variables
Method: assign each unique value to a unique number.
Example: Spacegroup = Pc is assigned to 0, Spacegroup = P21/c is assigned to 1, etc.
Pros: simple and quick, 1 column in -> 1 column out
Cons: residual "structure" (number assigned is arbitrary, and it leads algorithms to assume that a Spacegroup with a value of 20 is higher value than a Spacegroup of value 1)
3. Feature Hashing
Method: Encode each unique category into a non-binary vector
Example: Spacegroup = Pc is assigned to [1,0,0], Spacegroup = P21/c is assigned to [1,2,-1], etc. Specify number of columns (length of vector)
Pros: low dimensionality so really efficient.
Cons: potential collisions (for example the 1st value in example has both Spacegroups sharing a '1'); hashed features aren't interpretable so can't be used in feature importance.
4. Other methods
Primarily involve prior knowledge about dataset. Encode with own algorithm to include closely related features.
Variation on One Hot Encoding for large numbers of unique values: classify infrequent instances into "rare" category. May lose some granularity and important info, but also allows for new categories that aren't in training data
'Materials Id'
column
count 339 unique 339 top mp-849394 freq 1 Name: Materials Id, dtype: object
339 unique values for 339 unique entries; there is no useful information in this column and it can be dropped
Index(['Formula', 'Spacegroup', 'Formation Energy (eV)', 'E Above Hull (eV)', 'Band Gap (eV)', 'Nsites', 'Density (gm/cc)', 'Volume', 'Has Bandstructure', 'Crystal System'], dtype='object')
'Has Bandstructure'
column
2 unique values, True and False. Classic example of when to use one-hot encoding
'Spacegroup'
column
44 unique values, with most of them occuring multiple times
Option 1: One-hot encoding will result in 44 new feature columns; inefficient and memory-intensive.
Option 2: Encode to ordinal numbers. Will possibly work, but does leave a residual structure that may affect model performance
Option 3: Use Feature Hashing to create a vector representation of each unique Spacegroup. Note that if vector size = 44, it's the same as one-hot encoding, and if vector size = 1, it's the same as encoding to ordinal variables. Use vector size = 3 for this
C2 | C2/c | C2/m | C222 | C2221 | C2cm | Cc | Ccme | Cmce | Cmcm | ... | Pcmn | Pm21n | Pmc21 | Pmn21 | Pmnb | Pn21a | Pna21 | Pnc2 | Pnca | Pnma | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
2 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
3 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
4 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
334 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
335 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
336 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
337 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
338 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
339 rows × 44 columns
0 32 1 22 2 7 3 2 4 2 .. 334 17 335 17 336 17 337 17 338 17 Name: Spacegroup, Length: 339, dtype: int64
Formula Spacegroup Formation Energy (eV) E Above Hull (eV) \ 0 Li2MnSiO4 Pc -2.699 0.006 1 Li2MnSiO4 P21/c -2.696 0.008 2 Li4MnSi2O7 Cc -2.775 0.012 3 Li4Mn2Si3O10 C2/c -2.783 0.013 4 Li2Mn3Si3O10 C2/c -2.747 0.016 .. ... ... ... ... 334 Li6Co(SiO4)2 P1 -2.545 0.071 335 LiCo3(SiO4)2 P1 -2.250 0.076 336 Li5Co4(Si3O10)2 P1 -2.529 0.082 337 LiCoSiO4 P1 -2.348 0.087 338 Li3Co2(SiO4)2 P1 -2.406 0.090 Band Gap (eV) Nsites Density (gm/cc) Volume Has Bandstructure \ 0 3.462 16 2.993 178.513 1 1 2.879 32 2.926 365.272 1 2 3.653 28 2.761 301.775 1 3 3.015 38 2.908 436.183 1 4 2.578 36 3.334 421.286 1 .. ... ... ... ... ... 334 2.685 17 2.753 171.772 1 335 0.005 42 3.318 552.402 1 336 0.176 35 2.940 428.648 1 337 1.333 14 2.451 214.044 1 338 0.323 15 3.043 176.207 0 Crystal System Spacegroup (ordinal) Spacegroup0_ht Spacegroup1_ht \ 0 monoclinic 0 0.0 0.0 1 monoclinic 1 1.0 0.0 2 monoclinic 2 1.0 0.0 3 monoclinic 3 1.0 0.0 4 monoclinic 3 1.0 0.0 .. ... ... ... ... 334 triclinic 43 0.0 1.0 335 triclinic 43 0.0 1.0 336 triclinic 43 0.0 1.0 337 triclinic 43 0.0 1.0 338 triclinic 43 0.0 1.0 Spacegroup2_ht 0 1.0 1 0.0 2 0.0 3 0.0 4 0.0 .. ... 334 0.0 335 0.0 336 0.0 337 0.0 338 0.0 [339 rows x 14 columns]
For now, keep both sets of new features, and we'll see which one performs better
'Formula'
column
LiFeSiO4 42 LiCoSiO4 29 Li2FeSiO4 15 Li2CoSiO4 14 Li2MnSiO4 12 .. Li3Co2Si3O10 1 Li10Co(SiO5)2 1 Li4Co2Si3O10 1 Li2FeSi4O11 1 Li5Co4(Si3O10)2 1 Name: Formula, Length: 114, dtype: int64
114 unique values, most only occuring once. One-hot encoding is out of the question
Option 1,2,3: one-hot encoding, ordinal number encoding, and feature hashing all become inefficient with such variety.
Option 4: Use domain knowledge to create additional features. For example, we can look at the LiFeSiO4 formula, and turn it into 4 new columns, each one indicating how many of each atom are in the formula (for example, {Li: 1, Fe: 1, Si: 1, O: 4})
'Crystal System'
column
This is the target column, and there are 3 different types of crystal structures we're trying to classify. To properly transform this to numerical data, we have to understand if we are working on a multi-class problem or a multi-label problem.
- A multi-class problem is one in which there is only one distinct type of classification for each row. For example, a fruit is either an apple or an orange, but cannot be both. For a multi-class problem, the target value should be a single value, such as a 0 for apple and 1 for orange. In other words, it would be encoded to ordinal numbers.
- A multi-label problem is one in which there are possibly multiple labels for each row. For example, classifying pictures of apples and oranges can include a picture of an apple alone, an orange alone, or both an apple and an orange. For a multi-label problem, the target value should be a vector representation, such as [1,0] for apple, [0,1] for orange, and [1,1] for both apple and orange. In other words, we would have to one-hot encode the target feature.
Since the crystal system structure is unique, this is a multi-class problem. The target output should be encoded to a 0, 1, or 2. If it were a multi-label problem, the target output would have to be encoded to a vector of length 3.