Classifying early infant feeding status from clinical notes using natural language processing and machine learning

Dominick J. Lemas,Xinsong Du,Masoud Rouhizadeh,Braeden Lewis,Simon Frank,Lauren Wright,Alex Spirache,Lisa Gonzalez,Ryan Cheves,Marina Magalhães,Ruben Zapata,Rahul Reddy,Ke Xu,Leslie Parker,Chris Harle,Bridget Young,Adetola Louis-Jaques,Bouri Zhang,Lindsay Thompson,William R. Hogan,François Modave
DOI: https://doi.org/10.1038/s41598-024-58299-x
IF: 4.6
2024-04-04
Scientific Reports
Abstract:The objective of this study is to develop and evaluate natural language processing (NLP) and machine learning models to predict infant feeding status from clinical notes in the Epic electronic health records system. The primary outcome was the classification of infant feeding status from clinical notes using Medical Subject Headings (MeSH) terms. Annotation of notes was completed using TeamTat to uniquely classify clinical notes according to infant feeding status. We trained 6 machine learning models to classify infant feeding status: logistic regression, random forest, XGBoost gradient descent, k-nearest neighbors, and support-vector classifier. Model comparison was evaluated based on overall accuracy, precision, recall, and F1 score. Our modeling corpus included an even number of clinical notes that was a balanced sample across each class. We manually reviewed 999 notes that represented 746 mother-infant dyads with a mean gestational age of 38.9 weeks and a mean maternal age of 26.6 years. The most frequent feeding status classification present for this study was exclusive breastfeeding [n = 183 (18.3%)], followed by exclusive formula bottle feeding [n = 146 (14.6%)], and exclusive feeding of expressed mother's milk [n = 102 (10.2%)], with mixed feeding being the least frequent [n = 23 (2.3%)]. Our final analysis evaluated the classification of clinical notes as breast, formula/bottle, and missing. The machine learning models were trained on these three classes after performing balancing and down sampling. The XGBoost model outperformed all others by achieving an accuracy of 90.1%, a macro-averaged precision of 90.3%, a macro-averaged recall of 90.1%, and a macro-averaged F1 score of 90.1%. Our results demonstrate that natural language processing can be applied to clinical notes stored in the electronic health records to classify infant feeding status. Early identification of breastfeeding status using NLP on unstructured electronic health records data can be used to inform precision public health interventions focused on improving lactation support for postpartum patients.
multidisciplinary sciences
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is to predict the infant feeding status from clinical notes in the electronic health record system through natural language processing (NLP) and machine - learning techniques. Specifically, the goal of the study is to classify the infant feeding status from clinical notes to support lactation support interventions for postpartum patients. The paper details how to use NLP and machine - learning models to automatically identify and classify the infant's feeding methods, including exclusive breastfeeding, exclusive formula bottle - feeding, and other feeding methods. Through this method, the feeding patterns of infants in the hospital can be identified early, which is of great significance for increasing the exclusive breastfeeding rate. ### Research Background - **Importance of Breast Milk**: Human milk is considered the best source of nutrition for infant health and development, and it can promote neurocognitive development and protect infants from diseases such as infections, gastroenteritis, respiratory infections, obesity, diabetes, childhood leukemia, and sudden infant death syndrome. - **Recommendations of the World Health Organization**: The WHO recommends that infants should be exclusively breastfed in the first six months after birth and continue to be breastfed until the age of two or older. - **Current Challenges**: Although most infants start breastfeeding at birth, the proportion of those who can continue exclusive breastfeeding for six months is low. In addition, formula - feeding during hospitalization is associated with a shorter duration of exclusive breastfeeding, so it is crucial to support lactation in the early postpartum period. ### Research Methods - **Data Sources**: The study used the electronic health records of the University of Florida Health System, including the clinical notes of mothers and infants. - **Annotation Tools**: The TeamTat tool was used to annotate clinical notes and classify them according to the infant feeding status. - **Machine - Learning Models**: Six machine - learning models were trained, including logistic regression, random forest, XGBoost gradient descent, k - nearest neighbor, and support vector classifier, to classify the infant feeding status. - **Performance Evaluation**: The models were compared based on overall accuracy, precision, recall, and F1 - score. ### Main Results - **Model Performance**: The XGBoost model performed the best, achieving an accuracy of 90.1%, a macro - average precision of 90.3%, a macro - average recall of 90.1%, and a macro - average F1 - score of 90.1%. - **Classification Results**: The most common feeding status classification was exclusive breastfeeding (18.3%), followed by exclusive formula bottle - feeding (14.6%), expressed breastfeeding (10.2%), and mixed feeding was the least (2.3%). ### Discussion - **Main Contributions**: - Developed an NLP - based method that can extract infant feeding status from unstructured electronic health record data for enhancing population - level breastfeeding estimates. - Provided multi - level tools for extracting social and behavioral determinants that affect the health of infants and mothers. - **Advantages**: - Achieved high accuracy using conventional machine - learning algorithms, which is feasible and interpretable. - Can quickly and regularly characterize the infant feeding trends in the hospital without waiting for the annual survey results. - **Limitations**: - The data is from a single medical system, and the terminology in other institutions may be different. - The category of "bottle - feeding" is ambiguous and may include breast milk or formula milk, and a unified definition is required. ### Future Directions - **Consensus Definition**: Develop a consensus definition of early infant feeding to make hospital data consistent with the data collected at the national level. - **Continuous Improvement**: Cooperate with EHR companies to ensure the accuracy of input information, thereby maximizing the use of NLP for meaningful clinical data analysis. Through these methods, the study demonstrated the technical feasibility and high accuracy of NLP in classifying infant feeding status in clinical notes, providing new tools and methods for increasing the breastfeeding rate.