An open dataset for the evolution of oracle bone characters: EVOBC

Haisu Guan,Jinpeng Wan,Yuliang Liu,Pengjie Wang,Kaile Zhang,Zhebin Kuang,Xinyu Wang,Xiang Bai,Lianwen Jin

2024-02-13

Abstract:The earliest extant Chinese characters originate from oracle bone inscriptions, which are closely related to other East Asian languages. These inscriptions hold immense value for anthropology and archaeology. However, deciphering oracle bone script remains a formidable challenge, with only approximately 1,600 of the over 4,500 extant characters elucidated to date. Further scholarly investigation is required to comprehensively understand this ancient writing system. Artificial Intelligence technology is a promising avenue for deciphering oracle bone characters, particularly concerning their evolution. However, one of the challenges is the lack of datasets mapping the evolution of these characters over time. In this study, we systematically collected ancient characters from authoritative texts and websites spanning six historical stages: Oracle Bone Characters - OBC (15th century B.C.), Bronze Inscriptions - BI (13th to 221 B.C.), Seal Script - SS (11th to 8th centuries B.C.), Spring and Autumn period Characters - SAC (770 to 476 B.C.), Warring States period Characters - WSC (475 B.C. to 221 B.C.), and Clerical Script - CS (221 B.C. to 220 A.D.). Subsequently, we constructed an extensive dataset, namely EVolution Oracle Bone Characters (EVOBC), consisting of 229,170 images representing 13,714 distinct character categories. We conducted validation and simulated deciphering on the constructed dataset, and the results demonstrate its high efficacy in aiding the study of oracle bone script. This openly accessible dataset aims to digitalize ancient Chinese scripts across multiple eras, facilitating the decipherment of oracle bone script by examining the evolution of glyph forms.

Artificial Intelligence

What problem does this paper attempt to address?

The paper aims to address the challenging problem of Oracle bone script character recognition and evolution. Specifically, although approximately 4,500 Oracle bone script characters have been discovered, only about 1,600 characters have been successfully deciphered to date. To further understand and comprehensively analyze this ancient writing system, researchers have proposed using artificial intelligence technology to assist in the recognition and evolution study of Oracle bone script characters. However, the lack of a dataset that can demonstrate the evolution of these characters over time is a major obstacle. Therefore, the main goal of this paper is to construct such a dataset, namely the "Evolution of Oracle Bone Characters Dataset (EVOBC)," which includes the character evolution process from Oracle bone script to clerical script across six historical stages. By collecting ancient characters from authoritative literature and websites, and constructing a dataset containing 229,170 images covering 13,714 different character categories, EVOBC aims to digitize ancient Chinese characters from multiple eras, promoting the study of Oracle bone script characters. Additionally, the paper demonstrates the effectiveness of this dataset in computer-assisted Oracle bone script research through image classification tasks and technical validation.

An open dataset for the evolution of oracle bone characters: EVOBC

An open dataset for oracle bone script recognition and decipherment

An open dataset for oracle bone character recognition and decipherment

OBC306: A Large-Scale Oracle Bone Character Recognition Dataset

Oracle Bone Inscriptions Multi-modal Dataset

Deciphering Oracle Bone Language with Diffusion Models

A dataset of oracle characters for benchmarking machine learning algorithms

Diff-Oracle: Deciphering Oracle Bone Scripts with Controllable Diffusion Model

A study on encoding-based oracle bone script recognition

Puzzle Pieces Picker: Deciphering Ancient Chinese Characters with Radical Reconstruction

Dynamic Dataset Augmentation for Deep Learning-based Oracle Bone Inscriptions Recognition

Oracle Bone Script Intelligent Recognition: Automatic Segmentation and Recognition of Original Rubbing Single Characters

An Exploration of the Historical and Cultural Value of the Yin Ruins Oracle Bone Inscriptions and their Impact on the Evolution of Chinese Calligraphy

Ancient Chinese Character Recognition with Improved Swin-Transformer and Flexible Data Enhancement Strategies

IsOBS: an Information System for Oracle Bone Script.

The WuShu Database for Cursive Script Character and Style Recognition

OracleSage: Towards Unified Visual-Linguistic Understanding of Oracle Bone Scripts through Cross-Modal Knowledge Fusion

OBI-Bench: Can LMMs Aid in Study of Ancient Script on Oracle Bones?

Linking unknown characters via oracle bone inscriptions retrieval

Image Segmentation and Recognition of Oracle Bone Topographies Based on Deep Learning

A Cross-Font Image Retrieval Network for Recognizing Undeciphered Oracle Bone Inscriptions