Scalable Fault-Tolerant Data Feeds in AsterixDB

Raman Grover,Michael J. Carey

DOI: https://doi.org/10.48550/arXiv.1405.1705

2014-05-08

Abstract:In this paper we describe the support for data feed ingestion in AsterixDB, an open-source Big Data Management System (BDMS) that provides a platform for storage and analysis of large volumes of semi-structured data. Data feeds are a mechanism for having continuous data arrive into a BDMS from external sources and incrementally populate a persisted dataset and associated indexes. The need to persist and index "fast-flowing" high-velocity data (and support ad hoc analytical queries) is ubiquitous. However, the state of the art today involves 'gluing' together different systems. AsterixDB is different in being a unified system with "native support" for data feed ingestion. We discuss the challenges and present the design and implementation of the concepts involved in modeling and managing data feeds in AsterixDB. AsterixDB allows the runtime behavior, allocation of resources and the offered degree of robustness to be customized to suit the high-level application(s) that wish to consume the ingested data. Initial experiments that evaluate scalability and fault-tolerance of AsterixDB data feeds facility are reported.

Databases

What problem does this paper attempt to address?

The problem that this paper attempts to solve is to efficiently and reliably process continuous data streams (i.e., data feeds) from external sources in a Big Data Management System (BDMS). Specifically, the paper focuses on how to achieve the ingestion of data feeds on AsterixDB, an open - source big data management platform, and ensure that this process has the following characteristics: 1. **Genericity and Extensibility**: The system needs to be compatible with multiple data sources and advanced applications and support plug - and - play functional modifications. 2. **Fetch - Once Compute - Many Model**: One data feed can drive multiple applications simultaneously, and these applications may need to process the arriving data in different ways. 3. **Data Feed Monitoring and Resource Management**: The system needs to be able to monitor each feed and prevent or resolve bottlenecks by effectively allocating resources. 4. **Fault Tolerance**: Data feed ingestion is expected to run on commodity hardware and is therefore vulnerable to hardware failures. The system should be able to provide a certain degree of robustness to minimize data loss. 5. **Scalability**: As resources increase, the system should be able to handle an increasing amount of data, possibly in parallel from multiple data feeds. The paper mentions that traditional data management systems usually require data to be loaded and indexed before ad - hoc analytical queries can be performed. To keep up with "fast - moving" high - volume data, BDMS must be able to continuously ingest and persist data. AsterixDB, on the other hand, provides a unified system with "native support" for data feed ingestion, aiming to address the above challenges.

Scalable Fault-Tolerant Data Feeds in AsterixDB

AsterixDB: A Scalable, Open Source BDMS

An IDEA: An Ingestion Framework for Data Enrichment in AsterixDB

AFrame: Extending DataFrames for Large-Scale Modern Data Analysis (Extended Version)

DynaHash: Efficient Data Rebalancing in Apache AsterixDB (Extended Version)

PolyFrame: A Retargetable Query-based Approach to Scaling DataFrames (Extended Version)

A Scalable and Robust Framework for Data Stream Ingestion

BAD to the Bone: Big Active Data at its Core

Evaluating Accumulo Performance for a Scalable Cyber Data Processing Pipeline

Scalable RDF store based on HBase and MapReduce

Odysseus/DFS: Integration of DBMS and Distributed File System for Transaction Processing of Big Data

Revisiting Aggregation for Data Intensive Applications: A Performance Study

Ensuring High Data Quality and Error Resilience in Autonomous Self-Schedulable Libraries for Heterogeneous Data Sources in NearReal-Time Ingestion Pipelines

Scalable Database Access Technologies for ATLAS Distributed Computing

INGESTBASE: A Declarative Data Ingestion System

DataFed: Towards Reproducible Research via Federated Data Management

Poster: Benchmarking Financial Data Feed Systems

Data Provenance and Management in Radio Astronomy: A Stream Computing Approach

Enabling Massive XML-Based Biological Data Management in HBase

Eventdb: A Large-Scale Semi-Structured Scientific Data Management System

Fast Data Management with Distributed Streaming SQL