Abstract:Efficiently streaming high-volume data is essential for real-time data analytics, visualization, and AI and machine learning model training. Various streaming technologies and serialization protocols have been developed to meet different streaming needs. Together, they perform differently across various tasks and datasets. Therefore, when developing a streaming system, it can be challenging to make an informed decision on the suitable combination, as we encountered when implementing streaming for the UKAEA's MAST data or SKA's radio astronomy data. This study addresses this gap by proposing an empirical study of widely used data streaming technologies and serialization protocols. We introduce an extensible and open-source software framework to benchmark their efficiency across various performance metrics. Our findings reveal significant performance differences and trade-offs between these technologies. These insights can help in choosing suitable streaming and serialization solutions for contemporary data challenges. We aim to provide the scientific community and industry professionals with the knowledge to optimize data streaming for better data utilization and real-time analysis.
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: **How to efficiently select and combine data - streaming techniques and serialization protocols to meet the requirements of modern big - data challenges**. Specifically, with the rise of large - scale scientific experiments and data - intensive machine - learning algorithms, traditional data - transmission methods are no longer sufficient to meet the requirements for real - time analysis, visualization of massive data, and training of AI and machine - learning models. Especially when dealing with data such as MAST data from the UK Atomic Energy Authority (UKAEA) or radio - astronomy data from the Square Kilometre Array (SKA), how to select appropriate data - streaming techniques and serialization protocols has become an urgent problem to be solved.
To solve this problem, the paper conducts research in the following aspects:
1. **Evaluating multiple data - streaming techniques and serialization protocols**: The paper comprehensively evaluates 11 widely - used data - streaming techniques and 11 serialization protocols.
2. **Developing a benchmark - testing framework**: An extensible open - source software framework is introduced to measure the efficiency of different combinations of techniques and protocols in terms of 11 performance indicators.
3. **Detailed comparative analysis**: By testing 132 combinations, a detailed performance comparison of these techniques and protocols on six different types of data is provided.
4. **Revealing performance differences and trade - offs**: The research results not only reveal the performance differences and trade - offs between these techniques and protocols, but also discuss the limitations of the research and future research directions.
### Specific problem description
- **Background requirements**: As the amount of data generated by scientific experiments grows exponentially and data - intensive machine - learning algorithms are applied in scientific computing, traditional data - transmission methods can no longer meet the requirements for real - time data analysis. For example, the MAST facility generates several gigabytes of data every day, and this data needs to be efficiently transmitted and processed.
- **Existing challenges**: Currently, there is a lack of a systematic evaluation framework to guide how to select appropriate data - streaming techniques and serialization protocols. Especially when facing different tasks and datasets, different techniques perform significantly differently.
- **Objectives**: Through empirical research, provide in - depth insights required by the scientific community and industry when selecting data - streaming techniques and serialization protocols suitable for modern data challenges, in order to optimize data streams, improve data utilization, and enhance real - time analysis capabilities.
### Solutions
The paper solves the above problems through the following steps:
1. **Comprehensive evaluation**: Comprehensively evaluate 11 data - streaming techniques and 11 serialization protocols, covering their basic principles and operation frameworks.
2. **Developing a framework**: Design an extensible software framework for benchmark - testing the efficiency of different combinations of techniques and protocols, taking into account 11 performance indicators.
3. **Experimental verification**: Through actual testing of 132 combinations, provide a detailed performance - comparison analysis involving six different types of data.
4. **Result analysis**: Reveal the performance differences and trade - offs between different techniques and protocols, and discuss the limitations of the research and future research directions.
Through these efforts, the paper aims to provide guidance for the scientific community and industry in selecting appropriate data - streaming techniques and serialization protocols, so as to better cope with modern data challenges.