Abstract:Record linkage is a bedrock of quantitative social science, as analyses often require linking data from multiple, noisy sources. Off-the-shelf string matching methods are widely used, as they are straightforward and cheap to implement and scale. Not all character substitutions are equally probable, and for some settings there are widely used handcrafted lists denoting which string substitutions are more likely, that improve the accuracy of string matching. However, such lists do not exist for many settings, skewing research with linked datasets towards a few high-resource contexts that are not representative of the diversity of human societies. This study develops an extensible way to measure character substitution costs for OCR'ed documents, by employing large-scale self-supervised training of vision transformers (ViT) with augmented digital fonts. For each language written with the CJK script, we contrastively learn a metric space where different augmentations of the same character are represented nearby. In this space, homoglyphic characters - those with similar appearance such as ``O'' and ``0'' - have similar vector representations. Using the cosine distance between characters' representations as the substitution cost in an edit distance matching algorithm significantly improves record linkage compared to other widely used string matching methods, as OCR errors tend to be homoglyphic in nature. Homoglyphs can plausibly capture character visual similarity across any script, including low-resource settings. We illustrate this by creating homoglyph sets for 3,000 year old ancient Chinese characters, which are highly pictorial. Fascinatingly, a ViT is able to capture relationships in how different abstract concepts were conceptualized by ancient societies, that have been noted in the archaeological literature.

Transliteration Pair Extraction from Classical Chinese Buddhist Literature Using Phonetic Similarity Measurement

Off- Line Chinese Writer Identification Based on Character-Level Decision Combination

A Character Recognition Scheme Based on Object Oriented Design for Tibetan Buddhist Texts.

English-to-Chinese Transliteration with Phonetic Back-transliteration

How Transliterations Improve Crosslingual Alignment

Quantifying Character Similarity with Vision Transformers

Leveraging phone-level linguistic-acoustic similarity for utterance-level pronunciation scoring

Enhancing Cross-lingual Transfer via Phonemic Transcription Integration

Statistically-based Model for Computer-Aided Transcription Application

Heteronym Verification for Mandarin Speech Synthesis

When Does Classical Chinese Help? Quantifying Cross-Lingual Transfer in Hanja and Kanbun

Phonology-Augmented Statistical Framework for Machine Transliteration Using Limited Linguistic Resources

Distributional Similarity for Chinese: Exploiting Characters and Radicals

Efficient Entity Translation Mining

Cleansing Jewel: A Neural Spelling Correction Model Built On Google OCR-ed Tibetan Manuscripts

Two-Fold Linguistic Evidences on the Identification of Chinese Translation of Buddhist Sutras - Taking Buddhacarita as a Case.

Efficient Entity Translation Mining: A Parallelized Graph Alignment Approach

Research on Computing Word Similarity in Pre-Qin Classics Language Network Oriented to Digital Humanities

Reflection on Textual Transformation between the Similar Languages

Word Segmentation for Classical Chinese Buddhist Literature

How does Burrows' Delta work on medieval Chinese poetic texts?