Abstract:Addresses are one of the most important geographical reference systems in natural languages. In China, due to the relatively backward address planning, there are a large number of non-standard addresses. This kind of unstructured text makes the management and application of Chinese addresses much more difficult. However, by extracting the computational representations of addresses, it can be structured and its related applications can be extended more conveniently. Therefore, this paper utilizes a deep neural language model from natural language processing (NLP) to automatically extract computational representations through an unsupervised address language model (ALM), which is trained in an unsupervised way and is suitable for a large-scale address corpus. We propose a solution to fuse addresses and geospatial features and construct a geospatial-semantic address model (GSAM) that supports a variety of downstream tasks. Our proposed GSAM constructing process consists of three phases. First, we build an ALM using bidirectional encoder representations from Transformers (BERT) to learn the addresses' semantic representations. Then, the fusion clustering results of the semantic and geospatial information are obtained by a high-dimensional clustering algorithm. Finally, we construct the GSAM based on the fused clustering results using novel fine-tuning techniques. Furthermore, we apply the extracted computational representation from GSAM to the address location prediction task. The experimental results indicate that the target task accuracy of the ALM is 90.79%, and the result of semantic geospatial fusion clustering strongly correlates with fine-grained urban neighbourhood area division. The GSAM can accurately identify clustering labels and the values of evaluation metrics are all above 0.96. We also demonstrate that our model outperforms purely ALM-based and word2vec-based models by address location prediction task.

Using Multiple Sequence Alignment and Statistical Language Model to Integrate Multiple Chinese Address Recognition Outputs

An Efficient Post-Processing Approach for Off-Line Handwritten Chinese Address Recognition

A Post-processing Approach for Handwritten Chinese Address Recognition

GSAM: A Deep Neural Network Model for Extracting Computational Representations of Chinese Addresses Fused with Geospatial Feature

Recognition of Handwritten Chinese Address with Writing Variations

ASR-Based Input Method for Postal Address Recognition in Chinese Mandarin

Substring Alignment Method for Lexicon Based Handwritten Chinese String Recognition and Its Application to Address Line Recognition

A hidden Markov model based segmentation and recognition algorithm for Chinese handwritten address character strings

A hybrid handwritten chinese address recognition approach

Multi-task deep learning model based on hierarchical relations of address elements for semantic address matching

A Novel Segmentation and Recognition Algorithm for Chinese Handwritten Address Character Strings

Recognition Method of New Address Elements in Chinese Address Matching Based on Deep Learning

Handwritten Chinese Address Segmentation and Recognition Based on Merging Strokes

Cost-Sensitive Transformation for Chinese Address Recognition

A Chinese OCR Spelling Check Approach Based on Statistical Language Models.

A two-stage handwritten character segmentation approach in mail address recognition

Chinese Address Named Entity Recognition Based on BERT-BiLSTM-ATT-CRF Model

Combining Multiple Classifiers Based on Statistical Method for Handwritten Chinese Character Recognition

New method of combining multiple classifiers for Chinese character recognition

A Hybrid Post-Processing System For Offline Handwritten Chinese Character Recognition Based On A Statistical Language Model

A deep learning architecture for semantic address matching