MEGAnno: Exploratory Labeling for NLP in Computational Notebooks

Dan Zhang,Hannah Kim,Rafael Li Chen,Eser Kandogan,Estevam Hruschka
DOI: https://doi.org/10.48550/arXiv.2301.03095
2023-01-09
Abstract:We present MEGAnno, a novel exploratory annotation framework designed for NLP researchers and practitioners. Unlike existing labeling tools that focus on data labeling only, our framework aims to support a broader, iterative ML workflow including data exploration and model development. With MEGAnno's API, users can programmatically explore the data through sophisticated search and automated suggestion functions and incrementally update task schema as their project evolve. Combined with our widget, the users can interactively sort, filter, and assign labels to multiple items simultaneously in the same notebook where the rest of the NLP project resides. We demonstrate MEGAnno's flexible, exploratory, efficient, and seamless labeling experience through a sentiment analysis use case.
Human-Computer Interaction,Computation and Language
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the limitations of existing data annotation tools in natural language processing (NLP) research and practice. Specifically, these problems include: 1. **The gap between ML tools**: Most existing tools are independently designed and focus on specific steps in the machine - learning process, causing researchers to frequently switch contexts and transfer data in their daily work. 2. **Lack of customization and fine - grained control**: Not all data points are equally important. Users may wish to prioritize certain batches of data (for example, for better category or domain coverage, or to focus on data points that downstream models cannot predict well). Although some active - learning - based tools can provide suggestions for the next batch of data, most tools do not provide customization and fine - grained control in combination with downstream models. 3. **Lack of support for project evolution**: Current annotation tools usually assume that data collection tasks are clearly defined and immutable, ignoring that annotation projects can evolve during the exploration process and making it difficult to apply these changes. To solve these problems, the authors propose **MEGAnno**, a flexible, exploratory, efficient, and seamless data annotation framework designed to support the iterative work - flow of NLP researchers and practitioners throughout the machine - learning life cycle. The main features of MEGAnno include: - **Seamless integration**: It supports data pre - processing, annotation, analysis, model development, and evaluation in the same Jupyter Notebook. - **Customizable interface**: Through rich heuristic searches, automatic suggestions, and active - learning - based suggestions for the next batch of data, it helps users guide the project in the desired direction. - **Support for project evolution**: It is designed with a flexible task mode and provides a built - in analysis dashboard to assist decision - making. Through these features, MEGAnno aims to bridge the gap between existing tools, provide a more flexible and efficient annotation experience, and support the continuous evolution of projects.