Abstract:One of the most important trends in modern dialectological science is creating new electronic resources. The article gives an overview of Russian resources of this kind. Among them dialectal corpora hold a special place. The author of the article focuses on the Tomsk Dialect Corpus, which today includes more than 1,700,000 tokens. This resource is unparalleled in Russian scientific practice. It is designed as a universal information retrieval system which includes three modules: 1) textual, 2) grammatical, 3) lexicographic. The aim of the lexicographic component is to provide definitions of dialect lexemes. To do this, it is proposed to use the Dictionary of Russian Old-Timers’ Dialects of the Middle Part of the River Ob Basin (1964–1967) edited by V.V. Palagina and two supplements to it (1975, 1983–1986). The phases of the implementation of the lexicographic module into the Tomsk Dialect Corpus are described. The first phase was the automatic recognition of the above-mentioned paper dictionary. The second stage is editing the dictionary. The principles of editing the source material are determined by the fact that the lexicographic component is considered as part of a universal electronic system. Two basic editing principles are: the possibility to process a word automatically and the autonomous functioning of each dictionary entry. In accordance with them, the vocabulary and the structure of the dictionary entry were formed. At the stage of forming the vocabulary, some dictionary entries (for example, two-word ones) were discarded. The structure of the dictionary entry contains the main areas: headword, definition and contexts. One of the main editing tasks is to combine dictionary entries from different volumes of the dictionary into one. These words are marked either as homonyms, or as the meanings of one word. Examples of dictionary entries before and after editing are presented in the article. By now, about a half of the original vocabulary has been processed (letters from A to M, 12,450 entries). The final version of the electronic dictionary as part of the Tomsk Dialect Corpus is planned to be presented on the website of the Laboratory of General and Siberian Lexicography (http://losl.tsu.ru/) by June 2021. The prospects of the project include, firstly, the expansion of the vocabulary, and secondly, the implementation of search by dictionary labels (diminutives, augmentative, etc.) into the corpus. The presented solutions can be used in the development of other dialect corpora.

The comparison of Wiktionary thesauri transformed into the machine-readable format

Transformation of Wiktionary entry structure into tables and relations in a relational database schema

Computing Semantic Relatedness Using Structured Information of Wikipedia

Roget's Thesaurus as a Lexical Resource for Natural Language Processing

Presence or Absence: Are Unknown Word Usages in Dictionaries?

Comparing human and automatic thesaurus mapping approaches in the agricultural domain

Index wiki database: design and experiments

Japanese-Spanish Thesaurus Construction Using English as a Pivot

Multi-domain machine translation enhancements by parallel data extraction from comparable corpora

Materials to the Russian-Bulgarian Comparative Dictionary "EAD"

Synonym search in Wikipedia: Synarcher

Uncovering Differences in Persuasive Language in Russian versus English Wikipedia

WikiOmnia: generative QA corpus on the whole Russian Wikipedia

Exploiting Wikipedia to Measure the Semantic Relatedness between Arabic Terms

RUSSE: The First Workshop on Russian Semantic Similarity

Comparison of Syntactic Parsers on Biomedical Texts

The Value of Paraphrase for Knowledge Base Predicates.

From “Abarmo” to “Yashchichishko”: Creating the Lexicographic Component of the Tomsk Dialect Corpus

Automatic Extraction of Lexical Relations from Chinese Machine Readable Dictionary

Analysis of References Across Wikipedia Languages

HISTORY OF THE ORIGIN AND DEVELOPMENT OF SOME TEXT ANALYSIS SYSTEMS