Scalable Fault-Tolerant Data Feeds in AsterixDB

Raman Grover,Michael J. Carey
DOI: https://doi.org/10.48550/arXiv.1405.1705
2014-05-08
Abstract:In this paper we describe the support for data feed ingestion in AsterixDB, an open-source Big Data Management System (BDMS) that provides a platform for storage and analysis of large volumes of semi-structured data. Data feeds are a mechanism for having continuous data arrive into a BDMS from external sources and incrementally populate a persisted dataset and associated indexes. The need to persist and index "fast-flowing" high-velocity data (and support ad hoc analytical queries) is ubiquitous. However, the state of the art today involves 'gluing' together different systems. AsterixDB is different in being a unified system with "native support" for data feed ingestion. We discuss the challenges and present the design and implementation of the concepts involved in modeling and managing data feeds in AsterixDB. AsterixDB allows the runtime behavior, allocation of resources and the offered degree of robustness to be customized to suit the high-level application(s) that wish to consume the ingested data. Initial experiments that evaluate scalability and fault-tolerance of AsterixDB data feeds facility are reported.
Databases
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to efficiently and reliably process continuous data streams (i.e., data feeds) from external sources in a Big Data Management System (BDMS). Specifically, the paper focuses on how to achieve the ingestion of data feeds on AsterixDB, an open - source big data management platform, and ensure that this process has the following characteristics: 1. **Genericity and Extensibility**: The system needs to be compatible with multiple data sources and advanced applications and support plug - and - play functional modifications. 2. **Fetch - Once Compute - Many Model**: One data feed can drive multiple applications simultaneously, and these applications may need to process the arriving data in different ways. 3. **Data Feed Monitoring and Resource Management**: The system needs to be able to monitor each feed and prevent or resolve bottlenecks by effectively allocating resources. 4. **Fault Tolerance**: Data feed ingestion is expected to run on commodity hardware and is therefore vulnerable to hardware failures. The system should be able to provide a certain degree of robustness to minimize data loss. 5. **Scalability**: As resources increase, the system should be able to handle an increasing amount of data, possibly in parallel from multiple data feeds. The paper mentions that traditional data management systems usually require data to be loaded and indexed before ad - hoc analytical queries can be performed. To keep up with "fast - moving" high - volume data, BDMS must be able to continuously ingest and persist data. AsterixDB, on the other hand, provides a unified system with "native support" for data feed ingestion, aiming to address the above challenges.