Abstract:Background: Data mining in the field of medical data analysis often needs to rely solely on the processing of unstructured data to retrieve relevant data. For German natural language processing, few open medical neural named entity recognition (NER) models have been published before this work. A major issue can be attributed to the lack of German training data. Objective: We developed a synthetic data set and a novel German medical NER model for public access to demonstrate the feasibility of our approach. In order to bypass legal restrictions due to potential data leaks through model analysis, we did not make use of internal, proprietary data sets, which is a frequent veto factor for data set publication. Methods: The underlying German data set was retrieved by translation and word alignment of a public English data set. The data set served as a foundation for model training and evaluation. For demonstration purposes, our NER model follows a simple network architecture that is designed for low computational requirements. Results: The obtained data set consisted of 8599 sentences including 30,233 annotations. The model achieved a class frequency–averaged F 1 score of 0.82 on the test set after training across 7 different NER types. Artifacts in the synthesized data set with regard to translation and alignment induced by the proposed method were exposed. The annotation performance was evaluated on an external data set and measured in comparison with an existing baseline model that has been trained on a dedicated German data set in a traditional fashion. We discussed the drop in annotation performance on an external data set for our simple NER model. Our model is publicly available. Conclusions: We demonstrated the feasibility of obtaining a data set and training a German medical NER model by the exclusive use of public training data through our suggested method. The discussion on the limitations of our approach includes ways to further mitigate remaining problems in future work.

Annotated dataset creation through large language models for non-english medical NLP

Annotated Dataset Creation through General Purpose Language Models for non-English Medical NLP

German Medical Named Entity Recognition Model and Data Set Creation Using Machine Translation and Word Alignment: Algorithm Development and Validation

GERNERMED: An open German medical NER model

GERNERMED++: Semantic annotation in German medical NLP through transfer-learning, translation and word alignment

GERNERMED++: Transfer Learning in German Medical NLP

A tool for mapping medical narratives into medical ontologies in low resource settings: A case study for German

Comprehensive Study on German Language Models for Clinical and Biomedical Text Understanding

Information extraction from German radiological reports for general clinical text and language understanding

Viability of Open Large Language Models for Clinical Documentation in German Health Care: Real-World Model Evaluation Study

FRASIMED: a Clinical French Annotated Resource Produced through Crosslingual BERT-Based Annotation Projection

Automation of Trainable Datasets Generation for Medical-Specific Language Model: Using MIMIC-IV Discharge Notes

Healthcare NER Models Using Language Model Pretraining

Towards Democratizing Multilingual Large Language Models For Medicine Through A Two-Stage Instruction Fine-tuning Approach

Mapping SNOMED CT Codes to Semi-Structured Texts via an NLP Pipeline

Large Language Model Benchmarks in Medical Tasks

Evaluation of Language Models in the Medical Context Under Resource-Constrained Settings

SyntheT2C: Generating Synthetic Data for Fine-Tuning Large Language Models on the Text2Cypher Task

A Survey on Large Language Models from General Purpose to Medical Applications: Datasets, Methodologies, and Evaluations

Does Synthetic Data Generation of LLMs Help Clinical Text Mining?

Zero-Shot LLMs for Named Entity Recognition: Targeting Cardiac Function Indicators in German Clinical Texts