Ensuring High Data Quality and Error Resilience in Autonomous Self-Schedulable Libraries for Heterogeneous Data Sources in NearReal-Time Ingestion Pipelines

Venkata Tadi,
DOI: https://doi.org/10.47363/jeast/2023(5)259
2023-04-30
Journal of Engineering and Applied Sciences Technology
Abstract:In the era of big data, enterprises increasingly rely on near-real-time data ingestion pipelines to drive advanced analytics and machine learning models. The complexity and diversity of heterogeneous data sources pose significant challenges to maintaining high data quality and error resilience in these pipelines. This paper investigates strategies to ensure robust data quality and error management within autonomous self-schedulable libraries designed for handling diverse data formats. We explore architectural designs, best practices, and innovative techniques that enable seamless integration and real-time processing of disparate data sources. Key areas of focus include error detection and correction mechanisms, data validation frameworks, and resilient pipeline orchestration. Through comprehensive case studies and experimental evaluations, we demonstrate the efficacy of these strategies in enhancing the reliability and accuracy of data ingestion processes. Our findings provide a roadmap for enterprises seeking to optimize their data pipelines, ensuring they are equipped to handle the complexities of heterogeneous data environments with minimal human intervention.
What problem does this paper attempt to address?