Comparison of supervised learning statistical methods for classifying commercial beers and identifying patterns
Dániel Koren,Laura Lőrincz,Sándor Kovács,Gabriella Kun‐Farkas,Beáta Vecseriné Hegyes,László Sipos
DOI: https://doi.org/10.1002/cem.3216
IF: 2.5
2020-04-01
Journal of Chemometrics
Abstract:<p>In this study, 13 properties (alcohol‐, real extract‐, flavonoid‐, anthocyanin, glucose, fructose, maltose, sucrose content, EBC [European Brewery Convention] and L*a*b* color, bitterness) of 21 beers (alcohol‐free pale lagers, alcohol‐free beer‐based mixed drinks, beer‐based mixed drinks, international lagers, wheat beers, stouts, fruit beers) were determined. In the first step, multiple factor analysis (MFA) was performed for the whole data and five clusters (target classes) were determined; then, a bootstrapping was applied to establish a balanced data so as every cluster should contain 100 samples and the total sample size is 500. In the second step, 12 supervised learning algorithms (random trees [RND], Quinlan's C4.5 decision tree algorithm [C4.5], Iterative Dichotomiser 3 algorithm [ID3], cost‐sensitive decision tree algorithm [CSMC4], cost‐sensitive classification tree [CSCRT], <i>k</i>‐nearest neighbors algorithm [KNN], radial basis function [RBF], multilayer perceptron neural network [MLP], prototype nearest neighbor [PNN], linear discriminant analysis [LDA], naïve Bayes with continuous variables [NBC], partial least squares discriminant analysis [PLS‐DA]) were applied to classify each brand into the target classes. Furthermore, several error rates were calculated<i>:</i> re‐substitution error rate (RER), cross‐validated error rate (CV), bootsrap error (BOOT), leave‐one‐out (LOO), and train‐test error rate (TRAIN). The MFA could discriminate five groups, which can be characterized by some analytical parameters, and the other multivariate methods performed similarly. The methods can be discriminated best based on the BOOT, CV, and LOO. The best estimation methods are the C4.5, CSMC4, and CSCRT; these performed best along the flavonoid content and EBC color. It identified that the methods most sensitive to the properties are the NBC. The classification ability fluctuated greatly in the case of three properties (glucose, maltose, sucrose). A remarkable fluctuation has been experienced in the case of L*a*b* color parameters, flavonoid content, EBC color, and bitterness by NBC method.</p>
chemistry, analytical,instruments & instrumentation,mathematics, interdisciplinary applications,automation & control systems,computer science, artificial intelligence,statistics & probability