MatrixGate: A High-performance Data Ingestion Tool for Time-series Databases

Shuhui Wang,Zihan Sun,Chaochen Hu,Chao Li,Yong Zhang,Yandong Yao,Hao Wang,Chunxiao Xing
2024-06-08
Abstract:Recent years have seen massive time-series data generated in many areas. This different scenario brings new challenges, particularly in terms of data ingestion, where existing technologies struggle to handle such massive time-series data, leading to low loading speed and poor timeliness. To address these challenges, this paper presents MatrixGate, a new and efficient data ingestion approach for massive time-series data. MatrixGate implements both single-instance and multi-instance parallel procedures, which is based on its unique ingestion strategies. First, MatrixGate uses policies to tune the slots that are synchronized with segments to ingest data, which eliminates the cost of starting transactions and enhance the efficiency. Second, multi-coroutines are responsible for transfer data, which can increase the degree of parallelism significantly. Third, lock-free queues are used to enable direct data transfer without the need for disk storage or lodging in the master instance. Experiment results on multiple datasets show that MatrixGate outperforms state-of-the-art methods by 3 to 100 times in loading speed, and cuts down about 80% query latency. Furthermore, MatrixGate scales out efficiently under distributed architecture, achieving scalability of 86%.
Databases
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: in time - series databases, existing technologies have difficulty in efficiently handling the ingestion of large - scale time - series data, resulting in low loading speed and poor real - time performance. Specifically, with the wide application of intelligent network devices, a large amount of time - series data is generated in various fields (such as large - scale equipment monitoring, smart cars, smart cities, environmental monitoring, telecommunications, and finance). These data are generated at extremely high frequencies and rates, posing a huge challenge to data ingestion. ### Specific problems: 1. **Low loading speed**: Existing technologies are unable to quickly load a large amount of time - series data into the database. 2. **Poor real - time performance**: Since the data ingestion speed cannot keep up with the data generation speed, it leads to data latency or even loss. 3. **Insufficient scalability**: Existing methods have poor scalability in a distributed architecture and cannot effectively utilize multi - node resources. ### Solutions: To solve the above problems, the paper proposes MatrixGate, a high - performance time - series data ingestion tool. MatrixGate improves data ingestion efficiency through the following innovative strategies: 1. **Automatic slot adjustment**: Improve efficiency by automatically adjusting slots synchronized with segments to eliminate the cost of starting transactions. 2. **Multi - coroutine parallel processing**: Use multi - coroutines instead of multi - processes or multi - threads to achieve parallel processing and reduce scheduling overhead. 3. **Lock - free queue communication**: Build a data pipeline based on a lock - free queue to achieve direct data transfer without disk storage or main - instance staging, avoiding the single - point bottleneck problem. ### Experimental results: The experimental results show that MatrixGate has a loading speed 3 to 100 times faster than existing methods on multiple datasets, and the query latency is reduced by approximately 80%. In addition, MatrixGate has good scalability in a distributed architecture, achieving an 86% expansion efficiency. In conclusion, MatrixGate aims to solve the inefficiency problem of data ingestion in time - series databases and significantly improves the speed and real - time performance of data ingestion through innovative design and optimization strategies.