SogouT-16

Cheng Luo,Yukun Zheng,Yiqun Liu,Xiaochuan Wang,Jingfang Xu,Min Zhang,Shaoping Ma
DOI: https://doi.org/10.1145/3077136.3080694
2017-01-01
Abstract:Web collection is essential for many Web based researches such as Web Information Retrieval (IR), Web data mining, Corpus linguistics and so on. However, it is usually expensive and time-consuming to collect a large scale of Web pages in lab-based environment and public-available collection becomes a necessity for these researches. In this study, we present a Chinese Web collection, SogouT-16, which is the largest free-of-charge public Chinese Web collection so far. We provide a variety of descriptive characteristics of SogouT-16 and discuss its adoption in a newly-designed ad-hoc retrieval task in NTCIR-13, We Want Web. SogouT-16 also provides online retrieval service and contains a number of auxiliary resources including hyperlink structure graph, query logs, word embedding, and etc. We believe that SogouT-16 will provide new opportunities for novel investigations and applications in IR and other related communities.
What problem does this paper attempt to address?