Abstract:Purpose The present study is about generating metadata to enhance thematic transparency and facilitate research on interview collections at the Research Documentation Centre, Centre for Social Sciences (TK KDK) in Budapest. It explores the use of artificial intelligence (AI) in producing, managing and processing social science data and its potential to generate useful metadata to describe the contents of such archives on a large scale. Design/methodology/approach The authors combined manual and automated/semi-automated methods of metadata development and curation. The authors developed a suitable domain-oriented taxonomy to classify a large text corpus of semi-structured interviews. To this end, the authors adapted the European Language Social Science Thesaurus (ELSST) to produce a concise, hierarchical structure of topics relevant in social sciences. The authors identified and tested the most promising natural language processing (NLP) tools supporting the Hungarian language. The results of manual and machine coding will be presented in a user interface. Findings The study describes how an international social scientific taxonomy can be adapted to a specific local setting and tailored to be used by automated NLP tools. The authors show the potential and limitations of existing and new NLP methods for thematic assignment. The current possibilities of multi-label classification in social scientific metadata assignment are discussed, i.e. the problem of automated selection of relevant labels from a large pool. Originality/value Interview materials have not yet been used for building manually annotated training datasets for automated indexing of scientifically relevant topics in a data repository. Comparing various automated-indexing methods, this study shows a possible implementation of a researcher tool supporting custom visualizations and the faceted search of interview collections.

HuSpaCy: an industrial-strength Hungarian natural language processing toolkit

Advancing Hungarian Text Processing with HuSpaCy: Efficient and Accurate NLP Pipelines

Design and implementation of an open source Greek POS Tagger and Entity Recognizer using spaCy

MAKING USE OF A ‘SPACY’ MODULE IN THE NATURAL LANGUAGE PROCESSING

ScispaCy: Fast and Robust Models for Biomedical Natural Language Processing

"Approaches to sentiment analysis of Hungarian political news at the sentence level"

BEA-Base: A Benchmark for ASR of Spontaneous Hungarian

LatinCy: Synthetic Trained Pipelines for Latin NLP

Launching into clinical space with medspaCy: a new clinical text processing toolkit in Python

HugNLP: A Unified and Comprehensive Library for Natural Language Processing

NLPashto: NLP Toolkit for Low-resource Pashto Language

A New Massive Multilingual Dataset for High-Performance Language Technologies

A Multilingual Language Processing Tool for Uyghur, Kazak and Kirghiz

Data driven identification of international cutting edge science and technologies using SpaCy

Hespi: A pipeline for automatically detecting information from hebarium specimen sheets

Identification of social scientifically relevant topics in an interview repository: a natural language processing experiment

DaCy: A Unified Framework for Danish NLP

VNLP: Turkish NLP Package

Biomedical and clinical English model packages for the Stanza Python NLP library

Polypus: a Big Data Self-Deployable Architecture for Microblogging Text Extraction and Real-Time Sentiment Analysis