Abstract:Reusable, publicly available data is a pillar of open science and rapid advancement of cancer imaging research. Sharing data from completed research studies not only saves research dollars required to collect data, but also helps insure that studies are both replicable and reproducible. The Cancer Imaging Archive (TCIA) is a global shared repository for imaging data related to cancer. Insuring the consistency, scientific utility, and anonymity of data stored in TCIA is of utmost importance. As the rate of submission to TCIA has been increasing, both in volume and complexity of DICOM objects stored, the process of curation of collections has become a bottleneck in acquisition of data. In order to increase the rate of curation of image sets, improve the quality of the curation, and better track the provenance of changes made to submitted DICOM image sets, a custom set of tools was developed, using novel methods for the analysis of DICOM data sets. These tools are written in the programming language perl, use the open-source database PostgreSQL, make use of the perl DICOM routines in the open-source package Posda, and incorporate DICOM diagnostic tools from other open-source packages, such as dicom3tools. These tools are referred to as the "Posda Tools." The Posda Tools are open source and available via git at https://github.com/UAMS-DBMI/PosdaTools . In this paper, we briefly describe the Posda Tools and discuss the novel methods employed by these tools to facilitate rapid analysis of DICOM data, including the following: (1) use a database schema which is more permissive, and differently normalized from traditional DICOM databases; (2) perform integrity checks automatically on a bulk basis; (3) apply revisions to DICOM datasets on an bulk basis, either through a web-based interface or via command line executable perl scripts; (4) all such edits are tracked in a revision tracker and may be rolled back; (5) a UI is provided to inspect the results of such edits, to verify that they are what was intended; (6) identification of DICOM Studies, Series, and SOP instances using "nicknames" which are persistent and have well-defined scope to make expression of reported DICOM errors easier to manage; and (7) rapidly identify potential duplicate DICOM datasets by pixel data is provided; this can be used, e.g., to identify submission subjects which may relate to the same individual, without identifying the individual.

DICOM data storage and retrieval with MongoDB

A Distributed Storage and Access Approach for Massive Remote Sensing Data in MongoDB

[A distributed storage architecture for regional medical image sharing and cooperation based on HDFS]

The Open Connectome Project Data Cluster: Scalable Analysis and Vision for High-Throughput Neuroscience

RDMA-driven MongoDB: an Approach of RDMA Enhanced NoSQL Paradigm for Large-Scale Data Processing

Scalable, reproducible, and cost-effective processing of large-scale medical imaging datasets

Benchmarking SciDB Data Import on HPC Systems

Implementing the DICOM Standard for Digital Pathology

Dicoogle Open Source: The Establishment of a New Paradigm in Medical Imaging

A DICOM Framework for Machine Learning Pipelines against Real-Time Radiology Images

A Tool for Interactive Data Visualization: Application to Over 10,000 Brain Imaging and Phantom MRI Data Sets

Reengineering Workflow for Curation of DICOM Datasets

Highdicom: a Python Library for Standardized Encoding of Image Annotations and Machine Learning Model Outputs in Pathology and Radiology

A web-based institutional DICOM distribution system with the integration of the Clinical Trial Processor (CTP)

An Architecture to Define Cohorts over Medical Imaging Datasets

Whole Slide Image to DICOM Conversion as Event-Driven Cloud Infrastructure

Mochi: A Case Study in Translational Computer Science for High-Performance Computing Data Management

High performance on-demand de-identification of a petabyte-scale medical imaging data lake

Using DICOM Metadata for Radiological Image Series Categorization: a Feasibility Study on Large Clinical Brain MRI Datasets

Parallel Versus Distributed Data Access for Gigapixel-Resolution Histology Images: Challenges and Opportunities

Surface-based parcellation and vertex-wise analysis of ultra high-resolution ex vivo 7 tesla MRI in Alzheimer's disease and related dementias