A Clinical Bacterial Dataset for Deep Learning in Microbiological Rapid On-Site Evaluation

Xiuli Wang,Yinghan Shi,Shasha Guo,Xuzhong Qu,Fei Xie,Zhimei Duan,Ye Hu,Han Fu,Xin Shi,Tingwei Quan,Kaifei Wang,Lixin Xie
DOI: https://doi.org/10.1038/s41597-024-03370-5
2024-06-09
Scientific Data
Abstract:Microbiological Rapid On-Site Evaluation (M-ROSE) is based on smear staining and microscopic observation, providing critical references for the diagnosis and treatment of pulmonary infectious disease. Automatic identification of pathogens is the key to improving the quality and speed of M-ROSE. Recent advancements in deep learning have yielded numerous identification algorithms and datasets. However, most studies focus on artificially cultured bacteria and lack clinical data and algorithms. Therefore, we collected Gram-stained bacteria images from lower respiratory tract specimens of patients with lung infections in Chinese PLA General Hospital obtained by M-ROSE from 2018 to 2022 and desensitized images to produce 1705 images (4,912 × 3,684 pixels). A total of 4,833 cocci and 6,991 bacilli were manually labelled and differentiated into negative and positive. In addition, we applied the detection and segmentation networks for benchmark testing. Data and benchmark algorithms we provided that may benefit the study of automated bacterial identification in clinical specimens.
multidisciplinary sciences
What problem does this paper attempt to address?
The paper aims to address the issue of automatic bacterial identification in Microbial Rapid On-Site Evaluation (M-ROSE). Specifically, the study focuses on the following aspects: 1. **Background and Challenges**: - **Clinical Need**: Lower respiratory tract infections (LRTIs) are one of the global health threats, especially with the high incidence of hospital-acquired pneumonia (HAP) and ventilator-associated pneumonia (VAP) in intensive care units (ICUs), requiring early and accurate etiological diagnosis. - **Limitations of Existing Technology**: Currently, the identification of bacterial types and Gram staining in the M-ROSE process still relies on manual identification by experienced professionals, which is time-consuming and labor-intensive. 2. **Dataset Creation**: - Researchers collected Gram-stained bacterial images from patients with lung infections at the Chinese PLA General Hospital from 2018 to 2022. These images were desensitized, resulting in a total of 1,705 images (4912×3684 pixels), with 4,833 cocci and 6,991 bacilli annotated, and distinguished between Gram-positive and Gram-negative. 3. **Algorithm Development**: - A benchmark algorithm based on deep learning was provided for bacterial detection and segmentation tasks, using YOLOv5 for object detection (classifying Gram-positive and Gram-negative cocci and bacilli) and U-Net for semantic segmentation to distinguish Gram-positive and Gram-negative bacteria. 4. **Experimental Validation**: - Detailed annotation validation of the dataset was conducted, including re-examination of difficult-to-annotate bacteria and analysis of the confusion matrix. The results showed only a few inconsistent discriminations, indicating overall high annotation quality. - The dataset was trained and tested using YOLOv5 and U-Net networks, validating their effectiveness in bacterial classification, achieving high accuracy and consistency. Through the above work, the research team hopes to promote the development of automated bacterial identification technology, improving the speed and accuracy of M-ROSE in clinical applications, thereby enhancing the diagnosis and treatment of infectious diseases.