Abstract:Large language models (LLMs) have transformed human writing by enhancing grammar correction, content expansion, and stylistic refinement. However, their widespread use raises concerns about authorship, originality, and ethics, even potentially threatening scholarly integrity. Existing detection methods, which mainly rely on single-feature analysis and binary classification, often fail to effectively identify LLM-generated text in academic contexts. To address these challenges, we propose a novel Multi-level Fine-grained Detection (MFD) framework that detects LLM-generated text by integrating low-level structural, high-level semantic, and deep-level linguistic features, while conducting sentence-level evaluations of lexicon, grammar, and syntax for comprehensive analysis. To improve detection of subtle differences in LLM-generated text and enhance robustness against paraphrasing, we apply two mainstream evasion techniques to rewrite the text. These variations, along with original texts, are used to train a text encoder via contrastive learning, extracting high-level semantic features of sentence to boost detection generalization. Furthermore, we leverage advanced LLM to analyze the entire text and extract deep-level linguistic features, enhancing the model's ability to capture complex patterns and nuances while effectively incorporating contextual information. Extensive experiments on public datasets show that the MFD model outperforms existing methods, achieving an MAE of 0.1346 and an accuracy of 88.56%. Our research provides institutions and publishers with an effective mechanism to detect LLM-generated text, mitigating risks of compromised authorship. Educators and editors can use the model's predictions to refine verification and plagiarism prevention protocols, ensuring adherence to standards.

Learning Semantic Coherence for Machine Generated Spam Text Detection

Text Coherence Analysis Based on Deep Neural Network.

A Model of Coherence Based on Distributed Sentence Representation.

Camouflaged Chinese Spam Content Detection with Semi-supervised Generative Active Learning.

Semorph: A Morphology Semantic Enhanced Pre-trained Model for Chinese Spam Text Detection.

Learning to Rank Semantic Coherence for Topic Segmentation.

Deciphering Textual Authenticity: A Generalized Strategy through the Lens of Large Language Semantics for Detecting Human vs. Machine-Generated Text

On Improving Text Generation Via Integrating Text Coherence

A Novel Computational and Modeling Foundation for Automatic Coherence Assessment

Long Text Generation by Modeling Sentence-Level and Discourse-Level Coherence

Automatic Detection of Machine Generated Text: A Critical Survey

Improving Long Document Topic Segmentation Models With Enhanced Coherence Modeling

Semantic-Preserving Adversarial Text Attacks

Zero-Shot Detection of LLM-Generated Text using Token Cohesiveness

Neural Net Models for Open-Domain Discourse Coherence

Enhancing Text Authenticity: A Novel Hybrid Approach for AI-Generated Text Detection

Detect Camouflaged Spam Content via StoneSkipping: Graph and Text Joint Embedding for Chinese Character Variation Representation

Detection of spam reviews through a hierarchical attention architecture with N-gram CNN and Bi-LSTM

Unveiling Large Language Models Generated Texts: A Multi-Level Fine-Grained Detection Framework

Coherence boosting: When your pretrained language model is not paying enough attention

SemSeq4FD: Integrating global semantic relationship and local sequential order to enhance text representation for fake news detection