Serverless Workflows for Indexing Large Scientific Data

Tyler J. Skluzacek,Ryan Chard,Ryan Wong,Zhuozhao Li,Yadu N. Babuji,Logan Ward,Ben Blaiszik,Kyle Chard,Ian Foster
DOI: https://doi.org/10.1145/3366623.3368140
2019-01-01
Abstract:The use and reuse of scientific data is ultimately dependent on the ability to understand what those data represent, how they were captured, and how they can be used. In many ways, data are only as useful as the metadata available to describe them. Unfortunately, due to growing data volumes, large and distributed collaborations, and a desire to store data for long periods of time, scientific "data lakes" quickly become disorganized and lack the metadata necessary to be useful to researchers. New automated approaches are needed to derive metadata from scientific files and to use these metadata for organization and discovery. Here we describe one such system, Xtract, a service capable of processing vast collections of scientific files and automatically extracting metadata from diverse file types. Xtract relies on function as a service models to enable scalable metadata extraction by orchestrating the execution of many, short-running extractor functions. To reduce data transfer costs, Xtract can be configured to deploy extractors centrally or near to the data (i.e., at the edge). We present a prototype implementation of Xtract and demonstrate that it can derive metadata from a 7 TB scientific data repository.
What problem does this paper attempt to address?