Survey on Text Analysis and Recognition for Multiethnic Scripts
Wang Weilan,Hu Jinshui,Wei Hongxi,Ubul Kurban,Shao Wenyuan,Bi Xiaojun,He Jianjun,Li zhenjiang,Ding Kai,Jin Lianwen,Gao Liangcai
DOI: https://doi.org/10.11834/jig.240015
2024-01-01
Abstract:China's ethnic scripts differ in their structure types,creation periods,and regions of usage and scope.The his-torical documents and various literary materials written,recorded,and printed in ethnic scripts are even more voluminous,which leave an invaluable wealth for exploring the civilization and development history of different ethnic groups.Com-pared with mainstream languages,the study of ethnic minority scripts often faces low-resource conditions.In recent years,the protection and inheritance of the intangible cultural heritage of ethnic minorities have attracted increased attention from the country,which has great importance and application value for the protection of irreparable diverse cultural resources.By applying traditional image processing,pattern recognition,and machine learning methods,certain results have been achieved in text recognition and document recognition in Mongolian,Tibetan,Uyghur,Kazakh,Korean,and other major languages.Compared with mainstream languages such as English and Chinese,the research on the character recognition of minority languages,the analysis of document images,and the development of application systems is relatively lagging behind.Since the 21st century,the research and application of ethnic script text analysis and recognition have received extensive attention and made remarkable progress due to the continuous development and application of technologies in the field of document image analysis and recognition.They have become the research hotspots in the field of document analysis and recognition and artificial intelligence.However,a large number of problems still need to be solved in the field of minor-ity script text and recognition research due to the large number of minority scripts,the wide range of application scenarios,and the scarcity of datasets.This study reviews the development history and recent progress in this field at home and abroad to better summarize previous works and provide support for the subsequent research.It focuses on four subtasks:printed text recognition,handwriting recognition,historical document recognition,and scene text recognition of several minority texts.It mainly includes Tibetan,Mongolian,Uighur,Yi,Manchu,and Dongba.These studies are mainly related to the following areas.1)In the document image preprocessing stage,the system performs a series of operations on the input image,such as binarization,noise removal,skew correction,and image enhancement.The goal of preprocessing is to improve the accuracy of subsequent analysis and recognition.2)Layout analysis,such as layout segmentation,text line segmentation,and character segmentation,helps understand the organizational structure of documents and extract useful information.3)Text recognition is one of the core tasks of document image analysis,which identifies the text in a docu-ment through various technical approaches.This task may involve traditional methods such as text recognition based on single character classifiers,or it may include end-to-end text line recognition in deep learning methods.4)Dataset con-struction involves constructing various datasets for training and evaluating algorithms,such as document image binarization datasets,layout analysis datasets,text line datasets,and character datasets.By contrast,analysis and recognition of his-torical documents are difficult due to the complexities of rough,degraded,and damaged historical book papers,which result in severe background noise in the document image layout,sticky text strokes,unclear handwriting,and damage.At present,a practical recognition system for historical documents is lacking.First,the importance and value of minority script text analysis and recognition are explained,and some minority script texts,especially historical documents,and their characteristics are introduced.Then,the history of the development of the field and the current state of the research are reviewed,and the representative results of the research of the traditional methods and the progress of the research of the deep learning methods are analyzed and summarized.Current research objects are expanding in depth and breadth,with processing methods comprehensively shifting to deep neural network models and deep learning methods.The recognition performance is also greatly improved,and the application scenarios are constantly expanding.One of the studies realizes effective modeling under low resources.It further proposes a unified multilingual joint modeling technology to identify mul-tiple languages through one model,greatly reduce the overhead of hardware resources,and significantly improve the image and text recognition effect and generalization in multilingual scenarios.At present,it can recognize images and texts in 18 key languages or ethnic languages,including English,French,German,Japanese,Russian,Korean,Arabic,Uyghur,Kazakh,and Inner Mongolian.Based on relevant analyses,obvious deficiencies are observed in recognition accuracy and generalization ability,and differences with Chinese text recognition of ethnic script text recognition are found.The charac-teristics of the characters and documents of each language are completely different from those of Chinese characters and Chi-nese documents.For example,in the development of the Yi language,variant characters are particularly abundant due to various factors,and"one-to-many,many-to-one"characters and interpretations are the norm.The arbitrariness and diver-sity of historical Yi handwriting have brought great challenges to the recognition of historical Yi script.Moreover,the Tibetan script uses arabesque,the shape of the letters is complex,the black plum script is intertwined with each other,some strokes even span several characters before and after,and the connection between the letters is also relatively unique.Thus,the multi-style Tibetan recognition with high complexity and difficulty needs to be solved to achieve true multi-font text recognition.Finally,the main difficulties and challenges faced in the field of minority text recognition are discussed,and the future research trends and technical development goals are prospected.For example,research and application sys-tem development are conducted in combination with the characteristics of different languages,layout formats,and varying application scenarios.A certain gap still exists between the recognition of most ethnic languages and the development of Chinese recognition,especially in applications related to education,security,and people's livelihood.This gap can be addressed by actively expanding new application directions.Opportunities for expansion are abundant,such as migrating large language models to ethnic minority scripts and text recognition and developing a unified multilingual joint modeling and application system.