What's in a ? Cross-Lingual Topic Detection & Information Retrieval in Archives Portal Europe

Marta Musso,Kerstin Arnold,Federico Nanni,Beatrice Cannelli
DOI: https://doi.org/10.1145/3494572
2024-01-18
Journal on Computing and Cultural Heritage
Abstract:Archives Portal Europe (APE, www.archivesportaleurope.net) is the portal of European archives, an aggregator that connects on a single research point the catalogues and digitised archival material of all archives in and about Europe. It currently hosts material from more than 30 countries, and from a variety of archival institutions (such as State archives, city archives, university and parish archives, private institutions, and more). It is maintained by the Archives Portal Europe Foundation, an international consortium of State archives and other archival institutions that aim to connect the archival material of single institutions into one digital repository, in order to allow universal access to the archival heritage of Europe, promoting new forms of archival research beyond national or local boundaries. One of the research tools made available by Archives Portal Europe is by topics; however, these are currently maintained manually by the archivists, and the vast amount of archival material ingested in the portal makes it impossible to have a comprehensive body of topics that describe the whole of the APE repository. Archives are traditionally not organised by their subject content, but around the entity (person, organization, body) that created and/or collected the documents in the course of their activities. While this is an undisputed pillar of archival management, the availability of online digital repositories for archival research requires new tools for digital archival research, particularly when different archival traditions from different countries and different types of institutions are merged into a unique research portal. Topic detection becomes a fundamental tool to guide archival research and to allow archives to be accessible to potentially world-wide users, in a situation where national and linguistics barriers blur, or are re-defined. This paper presents the preliminary results and plan for future iterations of an AI tool for automated topic detection in a multi- lingual environment, where human-created taxonomies act as bases for the algorithms to aggregate relevant material around a specific topic. The development is based on supervised machine learning, with a combination of human inputs in different languages, and of the usage of Wikipedia pages to model the relevant vocabulary and entities.
computer science, interdisciplinary applications
What problem does this paper attempt to address?
The paper attempts to address the issue of achieving automatic topic detection in a multilingual environment to improve the accessibility and research efficiency of archival materials in the Archives Portal Europe (APE). Specifically, the paper focuses on the following points: 1. **Limitations of traditional archival organization methods**: Traditional archives are usually organized according to the entities (such as individuals, organizations, institutions) that created or collected these documents, rather than by topic content. This makes it difficult to quickly find relevant materials by topic in large digital archives. 2. **Challenges in a multilingual environment**: APE brings together materials from over 30 countries and various archival institutions, described in multiple languages. Therefore, there is a need for an automatic topic detection tool that can overcome language barriers, allowing users to easily browse and search these materials. 3. **Insufficiency of manual topic annotation**: Currently, the topics in APE are manually maintained by archivists, but this method is unable to cover all materials, and the standards for topic annotation vary between countries and institutions, leading to incomplete and inconsistent topic coverage. 4. **Need for automated tools**: To overcome the above issues, the paper proposes an automatic topic detection tool based on supervised machine learning. This tool uses human-created taxonomies and Wikipedia pages to model relevant vocabularies and entities, thereby automatically identifying and aggregating materials related to specific topics in a multilingual environment. By addressing these issues, the paper aims to enhance the research experience of users in a multilingual environment for European historical archives and promote cross-national and cross-language archival research.