Single Ground Truth Is Not Enough: Add Linguistic Variability to Aspect-based Sentiment Analysis Evaluation

Soyoung Yang,Hojun Cho,Jiyoung Lee,Sohee Yoon,Edward Choi,Jaegul Choo,Won Ik Cho
2024-10-13
Abstract:Aspect-based sentiment analysis (ABSA) is the challenging task of extracting sentiment along with its corresponding aspects and opinions from human language. Due to the inherent variability of natural language, aspect and opinion terms can be expressed in various surface forms, making their accurate identification complex. Current evaluation methods for this task often restrict answers to a single ground truth, penalizing semantically equivalent predictions that differ in surface form. To address this limitation, we propose a novel, fully automated pipeline that augments existing test sets with alternative valid responses for aspect and opinion terms. This approach enables a fairer assessment of language models by accommodating linguistic diversity, resulting in higher human agreement than single-answer test sets (up to 10%p improvement in Kendall's Tau score). Our experimental results demonstrate that Large Language Models (LLMs) show substantial performance improvements over T5 models when evaluated using our augmented test set, suggesting that LLMs' capabilities in ABSA tasks may have been underestimated. This work contributes to a more comprehensive evaluation framework for ABSA, potentially leading to more accurate assessments of model performance in information extraction tasks, particularly those involving span extraction.
Computation and Language,Artificial Intelligence
What problem does this paper attempt to address?
### Problems the Paper Attempts to Solve This paper aims to address the limitations of existing evaluation methods in the Aspect-Based Sentiment Analysis (ABSA) task. Specifically, current evaluation methods typically provide only a single ground truth (GT), which leads to unfair penalties for prediction results that are semantically equivalent but different in surface form. This single evaluation standard fails to fully reflect the diversity of natural language, potentially underestimating the ability of advanced language models to recognize and generate diverse, semantically equivalent expressions. To tackle this issue, the authors propose a new fully automated pipeline—Zoom-In-N-Out, which is used to expand the existing test set by adding valid alternative expressions for aspects and opinion terms. This method accommodates the diversity of language, making the evaluation of language models fairer and more comprehensive. Experimental results show that when evaluated with the expanded test set, large language models (LLMs) significantly outperform the T5 model, indicating that the capabilities of LLMs in the ABSA task may have been underestimated. ### Main Contributions 1. **Introduction of the Zoom-In-N-Out Pipeline**: This is a fully automated method that expands the existing ground truth set to cover various surface forms of aspects and opinion terms. 2. **Validation of Experimental Results**: The expanded ground truth set shows higher validity in human evaluations and greater consistency with human judgments. 3. **Performance Improvement**: Experiments demonstrate that when evaluated with the expanded test set, LLMs significantly outperform the T5 model in the ABSA task, revealing that the potential of LLMs in these tasks may have been underestimated. ### Method Overview The Zoom-In-N-Out pipeline consists of three main steps: 1. **Zoom-In**: Starting from the original ground truth terms, generate alternative expressions by reshaping the given segments or extracting meaningful parts. 2. **Zoom-Out**: Search and integrate nearby words within the entire sentence to generate more combinations. 3. **Judge and Filter**: Exclude inappropriate newly generated terms based on four criteria (relevance to aspects and categories, consistency with opinions and sentiments, extractability from the sentence, and independence from other terms). ### Experimental Results 1. **Dataset Validity**: Human evaluations show that the expanded ground truth set has a high percentage of human validity, with all percentages exceeding 90%. 2. **Consistency with Human Evaluations**: When evaluated with the expanded ground truth set, the model predictions show higher consistency with human judgments. 3. **Model Performance Improvement**: When evaluated with the expanded test set, LLMs' average F1 score increased by 9.8 percentage points, while the T5 model only increased by 2.3 percentage points. ### Conclusion This paper addresses the limitations of existing ABSA evaluation methods by introducing the Zoom-In-N-Out pipeline, providing a more comprehensive evaluation framework. Experimental results show that this method not only improves the fairness and accuracy of evaluations but also reveals that the potential of LLMs in the ABSA task may have been underestimated. This provides new directions and tools for future ABSA research.