Abstract:ABSTRACT Small sample sizes and loss of sequencing reads during the microbiome data preprocessing can limit the statistical power of differentiating fresh produce phenotypes and prevent the detection of important bacterial species associated with produce contamination or quality reduction. Here, we explored a machine learning-based k -mer hash analysis strategy to identify DNA signatures predictive of produce safety (PS) and produce quality (PQ) and compared it against the amplicon sequence variant (ASV) strategy that uses a typical denoising step and ASV-based taxonomy strategy. Random forest-based classifiers for PS and PQ using 7-mer hash data sets had significantly higher classification accuracy than those using the ASV data sets. We also demonstrated that the proposed combination of integrating multiple data sets and leveraging a 7-mer hash strategy leads to better classification performance for PS and PQ compared to the ASV method but presents lower PS classification accuracy compared to the feature-selected ASV-based taxonomy strategy. Due to the current limitation of generating taxonomy using the 7-mer hash strategy, the ASV-based taxonomy strategy with remarkably less computing time and memory usage is more efficient for PS and PQ classification and applicable for important taxa identification. Results generated from this study lay the foundation for future studies that wish and need to incorporate and/or compare different microbiome sequencing data sets for the application of machine learning in the area of microbial safety and quality of food. IMPORTANCE Identification of generalizable indicators for produce safety (PS) and produce quality (PQ) improves the detection of produce contamination and quality decline. However, effective sequencing read loss during microbiome data preprocessing and the limited sample size of individual studies restrain statistical power to identify important features contributing to differentiating PS and PQ phenotypes. We applied machine learning-based models using individual and integrated k -mer hash and amplicon sequence variant (ASV) data sets for PS and PQ classification and evaluated their classification performance and found that random forest (RF)-based models using integrated 7-mer hash data sets achieved significantly higher PS and PQ classification accuracy. Due to the limitation of taxonomic analysis for the 7-mer hash, we also developed RF-based models using feature-selected ASV-based taxonomic data sets, which performed better PS classification than those using the integrated 7-mer hash data set. The RF feature selection method identified 480 PS indicators and 263 PQ indicators with a positive contribution to the PS and PQ classification.

Systematic Comparisons for Composition Profiles, Taxonomic Levels, and Machine Learning Methods for Microbiome-Based Disease Prediction

Comparison of the effectiveness of different normalization methods for metagenomic cross-study phenotype prediction under heterogeneity

A Comparative Evaluation of Tools to Predict Metabolite Profiles From Microbiome Sequencing Data

A systematic machine learning and data type comparison yields metagenomic predictors of infant age, sex, breastfeeding, antibiotic usage, country of origin, and delivery type

A Review and Tutorial of Machine Learning Methods for Microbiome Host Trait Prediction

Machine learning approaches in microbiome research: challenges and best practices

Machine learning methods for microbiome studies

Microbial risk score for capturing microbial characteristics, integrating multi-omics data, and predicting disease risk

Leveraging Scheme for Cross-Study Microbiome Machine Learning Prediction and Feature Evaluations

A comparative study of supervised and unsupervised machine learning algorithms applied to human microbiome

Multimodal deep learning applied to classify healthy and disease states of human microbiome

Applications of Machine Learning in Human Microbiome Studies: A Review on Feature Selection, Biomarker Identification, Disease Prediction and Treatment

Microbiome-based classification models for fresh produce safety and quality evaluation

Correlation and association analyses in microbiome study integrating multiomics in health and disease

Dirichlet-tree multinomial mixtures for clustering microbiome compositions

Metagenomics Biomarkers Selected for Prediction of Three Different Diseases in Chinese Population

Microbiome Sample Comparison and Search: from Pair-Wise Calculations to Model-Based Matching

Gene-based microbiome representation enhances host phenotype classification

Longitudinal Microbiome-based Interpretable Machine Learning for Identification of Time-Varying Biomarkers in Early Prediction of Disease Outcomes

Faecal microbiome-based machine learning for multi-class disease diagnosis

Effects of Data Transformation and Model Selection on Feature Importance in Microbiome Classification Data