Abstract:Background The ability to efficiently search and filter datasets depends on access to high quality metadata. While most biomedical repositories require data submitters to provide a minimal set of metadata, some such as the Gene Expression Omnibus (GEO) allows users to specify additional metadata in the form of textual key-value pairs (e.g. sex: female). However, since there is no structured vocabulary to guide the submitter regarding the metadata terms to use, consequently, the 44,000,000+ key-value pairs in GEO suffer from numerous quality issues including redundancy, heterogeneity, inconsistency, and incompleteness. Such issues hinder the ability of scientists to hone in on datasets that meet their requirements and point to a need for accurate, structured and complete description of the data. Methods In this study, we propose a clustering-based approach to address data quality issues in biomedical, specifically gene expression, metadata. First, we present three different kinds of similarity measures to compare metadata keys. Second, we design a scalable agglomerative clustering algorithm to cluster similar keys together. Results Our agglomerative cluster algorithm identified metadata keys that were similar, based on (i) name, (ii) core concept and (iii) value similarities, to each other and grouped them together. We evaluated our method using a manually created gold standard in which 359 keys were grouped into 27 clusters based on six types of characteristics: (i) age, (ii) cell line, (iii) disease, (iv) strain, (v) tissue and (vi) treatment. As a result, the algorithm generated 18 clusters containing 355 keys (four clusters with only one key were excluded). In the 18 clusters, there were keys that were identified correctly to be related to that cluster, but there were 13 keys which were not related to that cluster. We compared our approach with four other published methods. Our approach significantly outperformed them for most metadata keys and achieved the best average F-Score (0.63). Conclusion Our algorithm identified keys that were similar to each other and grouped them together. Our intuition that underpins cleaning by clustering is that, dividing keys into different clusters resolves the scalability issues for data observation and cleaning, and keys in the same cluster with duplicates and errors can easily be found. Our algorithm can also be applied to other biomedical data types.

An Aggregation Query Processing Method of Dirty Database Based on Clustering

Using Visualization to Improve Clustering Analysis on Heterogeneous Information Network.

A fuzzy grouping mechanism for distributed interactive simulation.

An Effective Aggregation Method in Distributed Virtual Environments

Impacts of Dirty Data: and Experimental Evaluation

A Data Cleansing Method for Clustering Large-scale Transaction Databases

A Clustered Dwarf Structure to Speed Up Queries on Data Cubes

AQP++: Connecting Approximate Query Processing with Aggregate Precomputation for Interactive Analytics

Solutions to General Clustering Algorithmic Issues

A Statistical Information-Based Clustering Approach in Distance Space

Clustering Algorithms Used in Data Mining

A rough set based clustering algorithm and the information theoretical approach to refine clusters

Approximation Algorithms for Aggregate Queries on Uncertain Data

CLINCH: clustering incomplete high-dimensional data for data mining application

Cleaning by clustering: methodology for addressing data quality issues in biomedical metadata

Efficient sorting, duplicate removal, grouping, and aggregation

An Hierarchical Clustering Method Based on Data Fields

5New density clustering algorithm based on MapReduce

Akane: Perplexity-Guided Time Series Data Cleaning

Distance-based Data Cleaning: A Survey (Technical Report)

Optimization for Massive Data Query Method in Database