A Binary Feature Extraction Based Data Provenance System Implemented on Flink Platform.

Yangyizhou Wang,Lan Li,Lei Fan
DOI: https://doi.org/10.1109/cyberc.2018.00045
2018-01-01
Abstract:Data protection and the control of information flow are basic requirements for the security operation of enterprises or organizations. The data provenance of documents is a function that records the transmission of a specific document and provenance afterwards. As an important function of enterprise information security control, it has been confronted with the trouble of high management costs. Therefore, this paper attempts to recover the document content by proactively monitoring the internal traffic data of the enterprise and restore the document and find the parent document accurately through the proposed algorithm, thereby getting rid of the shackle of traditional document tracing. In order to ensure the flexibility and scalability of the streaming data restoration, this paper tries to build algorithm modules based on Flink, a streaming process platform, by migrating key computing services to its platform. In the process, the capture agent is set at the key node to collect traffic data, which is put into the stream processing system through the message queue. The stream processing system restores the file using document restoration algorithm, and finally the file is handed over to the feature extraction module. After the feature extraction module completes the file analysis, it is stored on file systems or structed data storage systems and waits for document tracking requests. The entire system solution achieved above and the daily business of the enterprise are completely seperated, while the load on the internal network flow is also very small. On the other hand, relying on the advantages of Flink's excellent distributed features, the experiments show that the data provenance results are satisfactory.
What problem does this paper attempt to address?