Oracle Bone Inscriptions Multi-modal Dataset

Bang Li,Donghao Luo,Yujie Liang,Jing Yang,Zengmao Ding,Xu Peng,Boyuan Jiang,Shengwei Han,Dan Sui,Peichao Qin,Pian Wu,Chaoyang Wang,Yun Qi,Taisong Jin,Chengjie Wang,Xiaoming Huang,Zhan Shu,Rongrong Ji,Yongge Liu,Yunsheng Wu
2024-07-04
Abstract:Oracle bone inscriptions(OBI) is the earliest developed writing system in China, bearing invaluable written exemplifications of early Shang history and paleography. However, the task of deciphering OBI, in the current climate of the scholarship, can prove extremely challenging. Out of the 4,500 oracle bone characters excavated, only a third have been successfully identified. Therefore, leveraging the advantages of advanced AI technology to assist in the decipherment of OBI is a highly essential research topic. However, fully utilizing AI's capabilities in these matters is reliant on having a comprehensive and high-quality annotated OBI dataset at hand whereas most existing datasets are only annotated in just a single or a few dimensions, limiting the value of their potential application. For instance, the Oracle-MNIST dataset only offers 30k images classified into 10 categories. Therefore, this paper proposes an Oracle Bone Inscriptions Multi-modal Dataset(OBIMD), which includes annotation information for 10,077 pieces of oracle bones. Each piece has two modalities: pixel-level aligned rubbings and facsimiles. The dataset annotates the detection boxes, character categories, transcriptions, corresponding inscription groups, and reading sequences in the groups of each oracle bone character, providing a comprehensive and high-quality level of annotations. This dataset can be used for a variety of AI-related research tasks relevant to the field of OBI, such as OBI Character Detection and Recognition, Rubbing Denoising, Character Matching, Character Generation, Reading Sequence Prediction, Missing Characters Completion task and so on. We believe that the creation and publication of a dataset like this will help significantly advance the application of AI algorithms in the field of OBI research.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The paper aims to address the application of artificial intelligence in the study of Oracle Bone Inscriptions (OBI). Currently, the interpretation of Oracle Bone Inscriptions is extremely challenging, especially since only about one-third of the 4,500 discovered characters have been successfully identified. To overcome this difficulty, the paper proposes the use of advanced artificial intelligence technology to assist in the interpretation of Oracle Bone Inscriptions. However, to fully leverage the potential of artificial intelligence in this field, a comprehensive and high-quality annotated dataset of Oracle Bone Inscriptions is required. Existing datasets often only provide simple annotation information, limiting their application value. Therefore, the paper proposes a new dataset called the "Oracle Bone Inscriptions Multi-modal Dataset" (OBIMD). This dataset contains annotation information for 10,077 pieces of Oracle Bones, each with two modalities: pixel-level aligned rubbings and tracings. The dataset provides detailed annotations including detection boxes, character categories, transcription texts, corresponding inscription groups, and the reading order within the groups, thereby offering comprehensive and high-quality data support for various AI research tasks related to Oracle Bone Inscriptions. By creating and releasing such a dataset, the researchers believe it can significantly advance the application of artificial intelligence algorithms in the study of Oracle Bone Inscriptions, promoting the development of tasks such as Oracle Bone translation, denoising, character matching, generation, reading sequence prediction, and missing character completion.