Machine Learning with PyTorch and Scikit-Learn
-- Code Examples
Package version checks
Add folder to path in order to load from the check_packages.py script:
Check recommended package versions:
[OK] Your Python version is 3.9.7 | packaged by conda-forge | (default, Sep 29 2021, 19:24:02) [Clang 11.1.0 ] [OK] numpy 1.22.1 [OK] matplotlib 3.5.1 [OK] sklearn 1.0.2 [OK] pandas 1.4.0
Chapter 6 - Learning Best Practices for Model Evaluation and Hyperparameter Tuning
Overview
Streamlining workflows with pipelines
...
Loading the Breast Cancer Wisconsin dataset
0 1 2 3 4 5 6 7 8 \ ,0 842302 M 17.99 10.38 122.80 1001.0 0.11840 0.27760 0.3001 ,1 842517 M 20.57 17.77 132.90 1326.0 0.08474 0.07864 0.0869 ,2 84300903 M 19.69 21.25 130.00 1203.0 0.10960 0.15990 0.1974 ,3 84348301 M 11.42 20.38 77.58 386.1 0.14250 0.28390 0.2414 ,4 84358402 M 20.29 14.34 135.10 1297.0 0.10030 0.13280 0.1980 , , 9 ... 22 23 24 25 26 27 28 29 \ ,0 0.14710 ... 25.38 17.33 184.60 2019.0 0.1622 0.6656 0.7119 0.2654 ,1 0.07017 ... 24.99 23.41 158.80 1956.0 0.1238 0.1866 0.2416 0.1860 ,2 0.12790 ... 23.57 25.53 152.50 1709.0 0.1444 0.4245 0.4504 0.2430 ,3 0.10520 ... 14.91 26.50 98.87 567.7 0.2098 0.8663 0.6869 0.2575 ,4 0.10430 ... 22.54 16.67 152.20 1575.0 0.1374 0.2050 0.4000 0.1625 , , 30 31 ,0 0.4601 0.11890 ,1 0.2750 0.08902 ,2 0.3613 0.08758 ,3 0.6638 0.17300 ,4 0.2364 0.07678 , ,[5 rows x 32 columns]
(569, 32)
array(['B', 'M'], dtype=object)
array([1, 0])
Combining transformers and estimators in a pipeline
Test accuracy: 0.956
<IPython.core.display.Image object>
Using k-fold cross validation to assess model performance
...
The holdout method
<IPython.core.display.Image object>
K-fold cross-validation
<IPython.core.display.Image object>
Fold: 01, Class distr.: [256 153], Acc.: 0.935 Fold: 02, Class distr.: [256 153], Acc.: 0.935 Fold: 03, Class distr.: [256 153], Acc.: 0.957 Fold: 04, Class distr.: [256 153], Acc.: 0.957 Fold: 05, Class distr.: [256 153], Acc.: 0.935 Fold: 06, Class distr.: [257 153], Acc.: 0.956 Fold: 07, Class distr.: [257 153], Acc.: 0.978 Fold: 08, Class distr.: [257 153], Acc.: 0.933 Fold: 09, Class distr.: [257 153], Acc.: 0.956 Fold: 10, Class distr.: [257 153], Acc.: 0.956 CV accuracy: 0.950 +/- 0.014
CV accuracy scores: [0.93478261 0.93478261 0.95652174 0.95652174 0.93478261 0.95555556 0.97777778 0.93333333 0.95555556 0.95555556] CV accuracy: 0.950 +/- 0.014
Debugging algorithms with learning curves
Diagnosing bias and variance problems with learning curves
<IPython.core.display.Image object>
<Figure size 432x288 with 1 Axes>
Addressing over- and underfitting with validation curves
<Figure size 432x288 with 1 Axes>
Fine-tuning machine learning models via grid search
Tuning hyperparameters via grid search
0.9846859903381642 {'svc__C': 100.0, 'svc__gamma': 0.001, 'svc__kernel': 'rbf'}
Test accuracy: 0.974
0.9737681159420291
{'svc__kernel': 'rbf', 'svc__gamma': 0.001, 'svc__C': 10.0}
Exploring hyperparameter configurations more widely with randomized search
<IPython.core.display.Image object>
array([8.30145146e-02, 1.10222804e+01, 1.00184520e-04, 1.30715777e-02, , 1.06485687e-03, 4.42965766e-04, 2.01289666e-03, 2.62376594e-02, , 5.98924832e-02, 5.91176467e-01])
More resource-efficient hyperparameter search with successive halving
0.9676470588235293 {'svc__kernel': 'rbf', 'svc__gamma': 0.0001, 'svc__C': 100.0}
Test accuracy: 0.965
Algorithm selection with nested cross-validation
<IPython.core.display.Image object>
CV accuracy: 0.974 +/- 0.015
CV accuracy: 0.934 +/- 0.016
Looking at different performance evaluation metrics
...
Reading a confusion matrix
<IPython.core.display.Image object>
[[71 1] [ 2 40]]
<Figure size 180x180 with 1 Axes>
Additional Note
Remember that we previously encoded the class labels so that malignant examples are the "postive" class (1), and benign examples are the "negative" class (0):
array([1, 0])
[[71 1] [ 2 40]]
Next, we printed the confusion matrix like so:
[[71 1] [ 2 40]]
Note that the (true) class 0 examples that are correctly predicted as class 0 (true negatives) are now in the upper left corner of the matrix (index 0, 0). In order to change the ordering so that the true negatives are in the lower right corner (index 1,1) and the true positves are in the upper left, we can use the labels
argument like shown below:
[[40 2] [ 1 71]]
We conclude:
Assuming that class 1 (malignant) is the positive class in this example, our model correctly classified 71 of the examples that belong to class 0 (true negatives) and 40 examples that belong to class 1 (true positives), respectively. However, our model also incorrectly misclassified 1 example from class 0 as class 1 (false positive), and it predicted that 2 examples are benign although it is a malignant tumor (false negatives).
Optimizing the precision and recall of a classification model
Precision: 0.976 Recall: 0.952 F1: 0.964 MCC: 0.943
0.9861994953378878 {'svc__C': 10.0, 'svc__gamma': 0.01, 'svc__kernel': 'rbf'}
Plotting a receiver operating characteristic
<Figure size 504x360 with 1 Axes>
The scoring metrics for multiclass classification
Dealing with class imbalance
89.92443324937027
Number of class 1 examples before: 40 Number of class 1 examples after: 357
50.0
Summary
...
Readers may ignore the next cell.
[NbConvertApp] Converting notebook ch06.ipynb to script [NbConvertApp] Writing 18900 bytes to ch06.py