Abstract:Large language models (LLMs) are capable of writing grammatical text that follows instructions, answers questions, and solves problems. As they have advanced, it has become difficult to distinguish their output from human-written text. While past research has found some differences in surface features such as word choice and punctuation, and developed classifiers to detect LLM output, none has studied the rhetorical styles of LLMs. Using several variants of Llama 3 and GPT-4o, we construct two parallel corpora of human- and LLM-written texts from common prompts. Using Douglas Biber's set of lexical, grammatical, and rhetorical features, we identify systematic differences between LLMs and humans and between different LLMs. These differences persist when moving from smaller models to larger ones, and are larger for instruction-tuned models than base models. This demonstrates that despite their advanced abilities, LLMs struggle to match human styles, and hence more advanced linguistic features can detect patterns in their behavior not previously recognized.

What problem does this paper attempt to address?

The problem this paper attempts to address is: Although large language models (LLMs) are capable of generating grammatically correct and instruction-following text, there are still differences between their style and rhetoric compared to human writing. Specifically, the paper aims to systematically analyze the differences in lexical, grammatical, and rhetorical features between LLMs and human writing to explore whether these models can truly mimic human writing style. ### Main Research Questions: 1. **Is the writing style of LLMs similar to that of humans?** - Researchers constructed parallel corpora to compare the writing styles of LLMs and humans in different types of texts, including grammatical structures, lexical choices, and rhetorical techniques. 2. **Are there differences in writing styles between different LLMs?** - Researchers analyzed the stylistic differences in text generation among multiple versions of LLMs (such as Llama 3 and GPT-4o), particularly between instruction-tuned models and base models. 3. **How does the writing style of LLMs change with different sizes and training methods?** - Researchers explored the impact of model size (such as Llama 3 8B vs. 70B) and training methods (such as base models vs. instruction-tuned models) on writing style. 4. **How can language features distinguish LLMs from human writing?** - Researchers used Biber's set of linguistic feature tags to extract various linguistic features and evaluated the effectiveness of these features in distinguishing LLMs from human writing through classification models. ### Research Methods: - **Data Collection**: Constructed two parallel corpora based on the Corpus of Contemporary American English (COCA) and a self-built Human-AI Parallel English Corpus (HAP-E). - **Text Generation**: Used multiple LLMs (such as Llama 3 and GPT-4o) to generate texts similar in style to human texts. - **Feature Extraction**: Used Biber's set of linguistic feature tags to extract lexical, grammatical, and rhetorical features from the texts. - **Classification Models**: Evaluated the effectiveness of these features in distinguishing LLMs from human writing using random forest and LASSO-penalized logistic regression models. ### Research Results: - **Overall Accuracy**: The random forest model achieved a 66% test accuracy in distinguishing LLMs from human writing, significantly higher than the 14% accuracy of random guessing. - **Stylistic Differences**: Instruction-tuned LLMs showed a clear preference for certain grammatical structures (such as present participle clauses, nominalization) and lexical choices, making them easier to distinguish. - **Impact of Model Size and Training Methods**: Larger models are not necessarily closer to human writing, and instruction-tuned models exhibit greater differences from human writing in certain features. - **Lexical Choices**: LLMs show significant differences from humans in the frequency of certain lexical uses, such as the overuse of some complex relational words and the low frequency of some uncommon words. ### Discussion: - **Role of Instruction Tuning**: Instruction tuning significantly influences the writing style of LLMs, making them more inclined towards an information-dense, nominalized writing style rather than a more natural human style. - **Importance of Linguistic Features**: Biber's set of linguistic feature tags performed well in modeling and classifying texts, revealing implicit differences between machine-generated texts and human writing. - **Education and Application**: The research results emphasize the need to pay attention to the limitations and improvement directions of LLM-generated texts in teaching and practical applications, especially in fields such as creative writing, teaching materials, and argumentative texts. In summary, this paper reveals the limitations of LLMs in writing style through systematic linguistic feature analysis and provides important references for future research and applications.

Do LLMs write like humans? Variation in grammatical and rhetorical styles

Contrasting Linguistic Patterns in Human and LLM-Generated Text

Whose LLM is it Anyway? Linguistic Comparison and LLM Attribution for GPT-3.5, GPT-4 and Bard

Contrasting Linguistic Patterns in Human and LLM-Generated News Text

A Confederacy of Models: a Comprehensive Evaluation of LLMs on Creative Writing

Human Variability vs. Machine Consistency: A Linguistic Analysis of Texts Generated by Humans and Large Language Models

Language models align with human judgments on key grammatical constructions

The LLM Effect: Are Humans Truly Using LLMs, or Are They Being Influenced By Them Instead?

Real or Robotic? Assessing Whether LLMs Accurately Simulate Qualities of Human Responses in Dialogue

Can AI writing be salvaged? Mitigating Idiosyncrasies and Improving Human-AI Alignment in the Writing Process through Edits

LLMs' Understanding of Natural Language Revealed

Vernacular? I Barely Know Her: Challenges with Style Control and Stereotyping

The Unlikely Duel: Evaluating Creative Writing in LLMs through a Unique Scenario

Easy Problems That LLMs Get Wrong

Large Language Models are as persuasive as humans, but how? About the cognitive effort and moral-emotional language of LLM arguments

"Which LLM should I use?": Evaluating LLMs for tasks performed by Undergraduate Computer Science Students

Measuring and Benchmarking Large Language Models' Capabilities to Generate Persuasive Language

Can Large Language Models Be an Alternative to Human Evaluations?

Do LLMs Find Human Answers To Fact-Driven Questions Perplexing? A Case Study on Reddit

Caveat Lector: Large Language Models in Legal Practice