Curated benchmark dataset for ultrasound based breast lesion analysis

Anna Pawłowska,Anna Ćwierz-Pieńkowska,Agnieszka Domalik,Dominika Jaguś,Piotr Kasprzak,Rafał Matkowski,Łukasz Fura,Andrzej Nowicki,Norbert Żołek
DOI: https://doi.org/10.1038/s41597-024-02984-z
2024-02-01
Scientific Data
Abstract:A new detailed dataset of breast ultrasound scans (BrEaST) containing images of benign and malignant lesions as well as normal tissue examples, is presented. The dataset consists of 256 breast scans collected from 256 patients. Each scan was manually annotated and labeled by a radiologist experienced in breast ultrasound examination. In particular, each tumor was identified in the image using a freehand annotation and labeled according to BIRADS features and lexicon. The histopathological classification of the tumor was also provided for patients who underwent a biopsy. The BrEaST dataset is the first breast ultrasound dataset containing patient-level labels, image-level annotations, and tumor-level labels with all cases confirmed by follow-up care or core needle biopsy result. To enable research into breast disease detection, tumor segmentation and classification, the BrEaST dataset is made publicly available with the CC-BY 4.0 license.
multidisciplinary sciences
What problem does this paper attempt to address?
The main problems that this paper attempts to solve are the issues of data set quality and reliability in breast cancer detection, tumor segmentation and classification. Specifically: 1. **Quality and reliability of data sets**: Existing breast ultrasound data sets have quality flaws, such as image duplication, incomplete labeling, and images containing non - breast regions. These problems lead to poor performance when the existing data sets are used to train machine - learning models. 2. **Labeling of multi - lesion images**: Existing data sets rarely contain detailed labeling of multi - lesion images, which limits their application in multi - lesion detection and segmentation tasks. 3. **Labeling of BI - RADS features**: Existing data sets lack detailed labeling of BI - RADS features. These features are very important for clinical diagnosis, but the existing data sets do not fully provide this information. 4. **Pathological confirmation**: Only a few of the existing data sets provide pathological confirmation, which limits the application of the data sets in verifying model performance. To solve the above problems, this paper introduces a new breast ultrasound data set (BrEaST), which has the following characteristics: - **High - quality data**: It contains 256 breast ultrasound scan images, and each image has been manually labeled and classified by experienced radiologists. - **Detailed labeling**: Each tumor is labeled by free - hand drawing and classified according to BI - RADS features and vocabulary. - **Pathological confirmation**: All cases have been confirmed by follow - up or core - needle biopsy results. - **Labeling of multi - lesion images**: The data set contains detailed labeling of multi - lesion images, supporting multi - lesion detection and segmentation tasks. - **Open access**: The data set is publicly released under the CC - BY 4.0 license for researchers to use. By providing such a high - quality, detailed - labeled and pathologically - confirmed data set, this paper aims to promote the research progress of breast disease detection, tumor segmentation and classification.