Abstract:Biomedical communication is an area that increasingly benefits from natural language processing (NLP) work. Biomedical named entity recognition (NER) in particular provides a foundation for advanced NLP applications, such as automated medical question-answering and translation services. However, while a large body of biomedical documents are available in an array of languages, most work in biomedical NER remains in English, with the remainder in official national or regional languages. Minority languages so far remain an underexplored area. The Hmong language, a minority language with sizable populations in several countries and without official status anywhere, represents an exceptional challenge for effective communication in medical contexts. Taking advantage of the large number of government-produced medical information documents in Hmong, we have developed the first named entity-annotated biomedical corpus for a resource-poor minority language. The Hmong Medical Corpus contains 100,535 tokens with 4554 named entities (NEs) of three UMLS semantic types: diseases/syndromes, signs/symptoms, and body parts/organs/organ components. Furthermore, a subset of the corpus is annotated for word position and parts of speech, representing the first such gold-standard dataset publicly available for Hmong. The methodology presented provides a readily reproducible approach for the creation of biomedical NE-annotated corpora for other resource-poor languages.

Bodo Resources for NLP - An Overview of Existing Primary Resources for Bodo

Part-of-Speech Tagger for Bodo Language using Deep Learning approach

Enhancing Assamese NLP Capabilities: Introducing a Centralized Dataset Repository

BhasaAnuvaad: A Speech Translation Dataset for 13 Indian Languages

Named Entity Recognition for Nepali Language

Hate Speech Detection in Low-Resource Bodo and Assamese Texts with ML-DL and BERT Models

An Overview of the Basic NLP Resources Towards Building the Assamese-English Machine Translation

AsNER -- Annotated Dataset and Baseline for Assamese Named Entity recognition

A Survey of Named Entity Recognition in Assamese and other Indian Languages

The Hmong Medical Corpus: a biomedical corpus for a minority language

MasakhaNER: Named Entity Recognition for African Languages

Snow Mountain: Dataset of Audio Recordings of The Bible in Low Resource Languages

First Attempt at Building Parallel Corpora for Machine Translation of Northeast India's Very Low-Resource Languages

Natural Language Processing in Ethiopian Languages: Current State, Challenges, and Opportunities

Towards Building ASR Systems for the Next Billion Users

A Survey of Corpora for Germanic Low-Resource Languages and Dialects

Voices Unheard: NLP Resources and Models for Yorùbá Regional Dialects

TutorialBank: A Manually-Collected Corpus for Prerequisite Chains, Survey Extraction and Resource Recommendation

A Systematic Study and Analysis of Bengali Folklore with Natural Language Processing Systems

Google Crowdsourced Speech Corpora and Related Open-Source Resources for Low-Resource Languages and Dialects: An Overview

Exploring News Summarization and Enrichment in a Highly Resource-Scarce Indian Language: A Case Study of Mizo