Abstract:In this work, we present a novel approach to lexical complexity prediction (LCP) that combines diverse linguistic features with encodings from deep neural networks. We explore the integration of 23 handcrafted linguistic features with embeddings from two well-known language models: BERT and XLM-RoBERTa. Our method concatenates these features before inputting them into various machine learning algorithms, including SVM, Random Forest, and fine-tuned transformer models. We evaluate our approach using two datasets: CompLex for English (a high-resource language) and CLexIS2 for Spanish (a relatively low-resource language in ), allowing us to study performance issues from a cross-lingual perspective. Our experiments involve different combinations of linguistic features with encodings from pretrained deep learning models, testing both token-based and sequence-related encodings. The results demonstrate the effectiveness of our hybrid approach. For the English CompLex corpus, our best model achieved a mean absolute error (MAE) of 0.0683, representing a 29.2% improvement over using linguistic features alone (MAE 0.0965). On the Spanish CLexIS2 corpus, we achieved an MAE of 0.1323, a 19.4. These findings show that handcrafted linguistic features play a fundamental role in achieving higher performance, particularly when combined with deep learning approaches. Our work suggests that hybrid approaches should be considered over full end-to-end solutions for LCP tasks, especially in multilingual contexts.

OCHADAI-KYOTO at SemEval-2021 Task 1: Enhancing Model Generalization and Robustness for Lexical Complexity Prediction

SemEval-2021 Task 1: Lexical Complexity Prediction

LCP-RIT at SemEval-2021 Task 1: Exploring Linguistic Features for Lexical Complexity Prediction

UPB at SemEval-2021 Task 1: Combining Deep Learning and Hand-Crafted Features for Lexical Complexity Prediction

LAST at SemEval-2021 Task 1: Improving Multi-Word Complexity Prediction Using Bigram Association Measures

Difficult for Whom? A Study of Japanese Lexical Complexity

Japanese Lexical Complexity for Non-Native Readers: A New Dataset

Large Language Models aren't all that you need

Deep Encodings vs. Linguistic Features in Lexical Complexity Prediction

Lexical Complexity Prediction: An Overview

Lexical Complexity Controlled Sentence Generation

A Context-Aware Approach for the Identification of Complex Words in Natural Language Texts

Enhancing Model Robustness Via Lexical Distilling

MANTIS at TSAR-2022 Shared Task: Improved Unsupervised Lexical Simplification with Pretrained Encoders

XRJL-HKUST at SemEval-2021 Task 4: WordNet-Enhanced Dual Multi-head Co-Attention for Reading Comprehension of Abstract Meaning

CIRCE at SemEval-2020 Task 1: Ensembling Context-Free and Context-Dependent Word Representations

Unsupervised Paraphrasing of Multiword Expressions

Exploiting Word Semantics to Enrich Character Representations of Chinese Pre-trained Models

Improving Lexical Embeddings for Robust Question Answering

ComplexityNet: Increasing LLM Inference Efficiency by Learning Task Complexity

UM6P-CS at SemEval-2022 Task 11: Enhancing Multilingual and Code-Mixed Complex Named Entity Recognition via Pseudo Labels using Multilingual Transformer