Abstract:In a recent study, we found that publicly BCCD and BCD datasets have significant issues such as labeling errors, insufficient sample size, and poor data quality. To address these problems, we performed sample deletion, re-labeling, and integration of these two datasets. Additionally, we introduced the PBC and Raabin-WBC datasets, and ultimately created a high-quality, sample-balanced new dataset, which we named TXL-PBC. The dataset contains 1008 training sets, 288 validation sets, and 144 test sets. Firstly, The dataset underwent strict manual annotation, automatic annotation with YOLOv8n model, and manual audit steps to ensure the accuracy and consistency of annotations. Secondly, we addresses the blood cell mislabeling problem of the original datasets. The distribution of label boundary box areas and the number of labels are better than the BCCD and BCD datasets. Moreover, we used the YOLOv8n model to train these three datasets, the performance of the TXL-PBC dataset surpass the original two datasets. Finally, we employed YOLOv5n, YOLOv5s, YOLOv5l, YOLOv8s, YOLOv8m detection models as the baseline models for TXL-PBC. This study not only enhances the quality of the blood cell dataset but also supports researchers in improving models for blood cell target detection. We published our freely accessible TXL-PBC dataset at <a class="link-external link-https" href="https://github.com/lugan113/TXL-PBC" rel="external noopener nofollow">this https URL</a>\_Dataset.

What problem does this paper attempt to address?

The main objective of this paper is to address the issues present in existing publicly available blood cell datasets (such as BCCD and BCD), which include labeling errors, insufficient sample size, and poor data quality. To solve these problems, the authors conducted the following work: 1. **Sample Screening and Integration**: Low-quality samples were removed from the BCCD and BCD datasets, and the remaining samples were re-annotated and integrated. 2. **Introduction of New Datasets**: Two new datasets, PBC (Peripheral Blood Cells) and Raabin-WBC, were introduced, and five types of white blood cell samples were selected for semi-automatic annotation. 3. **Creation of New Dataset TXL-PBC**: The processed datasets were integrated, randomly shuffled, and renamed to ensure sample diversity and randomness. A new dataset, TXL-PBC, was finally created, containing 1008 training sets, 288 validation sets, and 144 test sets. 4. **Quality Assurance**: A strict annotation process was followed to ensure the accuracy and consistency of the annotations, including manual annotation, automatic annotation using the YOLOv8n model, and manual review. 5. **Performance Evaluation**: The YOLOv8n model was used to train on the TXL-PBC dataset and other original datasets, and their performance metrics (such as precision, recall, mAP, etc.) across different categories were compared. The results showed that the TXL-PBC dataset performed significantly better than other datasets. 6. **Baseline Models**: To further evaluate the quality of the TXL-PBC dataset, the paper also selected various object detection models (such as YOLOv5n, YOLOv5s, YOLOv5l, YOLOv8s, and YOLOv8m) as baseline models and compared their performance on the TXL-PBC dataset. In summary, this study not only improves the quality of blood cell datasets but also supports researchers in enhancing models for blood cell object detection. Additionally, the paper discusses future work directions, such as expanding the diversity and quantity of the dataset and exploring more advanced object detection models. Finally, the authors publicly release the TXL-PBC dataset to enable more researchers to use it for further research.

TXL-PBC: a freely accessible labeled peripheral blood cell dataset

Point Beyond Class: A Benchmark for Weakly Semi-supervised Abnormality Localization in Chest X-Rays

BCData: A Large-Scale Dataset and Benchmark for Cell Detection and Counting

Automatic Blood Cell Detection Based on Advanced YOLOv5s Network

WBCAtt: A White Blood Cell Dataset Annotated with Detailed Morphological Attributes

LBD: a Manually Curated Database of Experimentally Validated Lymphoma Biomarkers

A large dataset of white blood cells containing cell locations and types, along with segmented nuclei and cytoplasm

Towards Cross-Domain Single Blood Cell Image Classification via Large-Scale LoRA-based Segment Anything Model

Advancing White Blood Cell Detection: A Multi-Domain Dataset for Morphological Analysis and Addressing Sparse Annotation Challenges

Annotations of Lung Abnormalities in Shenzhen Chest X-ray Dataset for Computer-Aided Screening of Pulmonary Diseases

DWS-YOLO: A Lightweight Detector for Blood Cell Detection

CST-YOLO: A Novel Method for Blood Cell Detection Based on Improved YOLOv7 and CNN-Swin Transformer

A Large-scale Multi Domain Leukemia Dataset for the White Blood Cells Detection with Morphological Attributes for Explainability

A large multi-focus dataset for white blood cell classification

Learning an Improved Object Detection Approach Based on the YOLO Algorithm to Identify Circulating Tumor Cells

Cx22: A new publicly available dataset for deep learning-based segmentation of cervical cytology images

Bio-net dataset: AI-based diagnostic solutions using peripheral blood smear images

Benchmarking White Blood Cell Classification Under Domain Shift

TJDR: A High-Quality Diabetic Retinopathy Pixel-Level Annotation Dataset

TW-YOLO: An Innovative Blood Cell Detection Model Based on Multi-Scale Feature Fusion

Long-tailed multi-label classification with noisy label of thoracic diseases from chest X-ray