TStego-THU: Large-Scale Text Steganalysis Dataset
Zhongliang Yang,Jin He,Siyu Zhang,Jinshuai Yang,Yongfeng Huang
DOI: https://doi.org/10.1007/978-3-030-78621-2_27
2021-01-01
Abstract:In recent years, with the development of natural language processing (NLP) technology, linguistic steganography has developed rapidly. However, to the best of our knowledge, currently there is no public dataset for text steganalysis, which makes it difficult for linguistic steganalysis methods to get a fair comparison. Therefore, in this paper, we construct and release a large-scale linguistic steganalysis dataset called TStego-THU, which we hope to provide a fair enough platform for comparison of linguistic steganalysis algorithms and further promote the development of linguistic steganalysis. TStego-THU includes two kinds of text steganography modes, namely, text modification-based and text generation-based modes, each of which provides two latest or classical text steganography algorithms. All texts in TStego-THU come from three common transmitted text medias in cyberspace: News, Twitter and commentary text. Finally, TStego-THU contains 240,000 sentences (120,000 cover-stego text pairs), each steganographic sentence is generated by randomly choosing one of these four steganographic algorithms and embedding random bitstream into randomly extracted normal texts. At the same time, we also evaluate some latest text steganalysis algorithms as benchmarks on TStego-THU, the detail results can be found in the experiment part. We hope that TStego-THU will further promote the development of universal text steganalysis technology. The description of TStego-THU and instructions will be released here: https://github.com/YangzlTHU/Linguistic-Steganography-and-Steganalysis.