Abstract:In a recent study, we found that publicly BCCD and BCD datasets have significant issues such as labeling errors, insufficient sample size, and poor data quality. To address these problems, we performed sample deletion, re-labeling, and integration of these two datasets. Additionally, we introduced the PBC and Raabin-WBC datasets, and ultimately created a high-quality, sample-balanced new dataset, which we named TXL-PBC. The dataset contains 1008 training sets, 288 validation sets, and 144 test sets. Firstly, The dataset underwent strict manual annotation, automatic annotation with YOLOv8n model, and manual audit steps to ensure the accuracy and consistency of annotations. Secondly, we addresses the blood cell mislabeling problem of the original datasets. The distribution of label boundary box areas and the number of labels are better than the BCCD and BCD datasets. Moreover, we used the YOLOv8n model to train these three datasets, the performance of the TXL-PBC dataset surpass the original two datasets. Finally, we employed YOLOv5n, YOLOv5s, YOLOv5l, YOLOv8s, YOLOv8m detection models as the baseline models for TXL-PBC. This study not only enhances the quality of the blood cell dataset but also supports researchers in improving models for blood cell target detection. We published our freely accessible TXL-PBC dataset at <a class="link-external link-https" href="https://github.com/lugan113/TXL-PBC" rel="external noopener nofollow">this https URL</a>\_Dataset.
What problem does this paper attempt to address?
The main objective of this paper is to address the issues present in existing publicly available blood cell datasets (such as BCCD and BCD), which include labeling errors, insufficient sample size, and poor data quality. To solve these problems, the authors conducted the following work:
1. **Sample Screening and Integration**: Low-quality samples were removed from the BCCD and BCD datasets, and the remaining samples were re-annotated and integrated.
2. **Introduction of New Datasets**: Two new datasets, PBC (Peripheral Blood Cells) and Raabin-WBC, were introduced, and five types of white blood cell samples were selected for semi-automatic annotation.
3. **Creation of New Dataset TXL-PBC**: The processed datasets were integrated, randomly shuffled, and renamed to ensure sample diversity and randomness. A new dataset, TXL-PBC, was finally created, containing 1008 training sets, 288 validation sets, and 144 test sets.
4. **Quality Assurance**: A strict annotation process was followed to ensure the accuracy and consistency of the annotations, including manual annotation, automatic annotation using the YOLOv8n model, and manual review.
5. **Performance Evaluation**: The YOLOv8n model was used to train on the TXL-PBC dataset and other original datasets, and their performance metrics (such as precision, recall, mAP, etc.) across different categories were compared. The results showed that the TXL-PBC dataset performed significantly better than other datasets.
6. **Baseline Models**: To further evaluate the quality of the TXL-PBC dataset, the paper also selected various object detection models (such as YOLOv5n, YOLOv5s, YOLOv5l, YOLOv8s, and YOLOv8m) as baseline models and compared their performance on the TXL-PBC dataset.
In summary, this study not only improves the quality of blood cell datasets but also supports researchers in enhancing models for blood cell object detection. Additionally, the paper discusses future work directions, such as expanding the diversity and quantity of the dataset and exploring more advanced object detection models. Finally, the authors publicly release the TXL-PBC dataset to enable more researchers to use it for further research.