LLM-Generated Natural Language Meets Scaling Laws: New Explorations and Data Augmentation Methods

Zhenhua Wang,Guang Xu,Ming Ren

2024-06-29

Abstract:With the ascent of large language models (LLM), natural language processing has witnessed enhancements, such as LLM-based data augmentation. Nonetheless, prior research harbors two primary concerns: firstly, a lack of contemplation regarding whether the natural language generated by LLM (LLMNL) truly aligns with human natural language (HNL), a critical foundational question; secondly, an oversight that augmented data is randomly generated by LLM, implying that not all data may possess equal training value, that could impede the performance of classifiers. To address these challenges, we introduce the scaling laws to intrinsically calculate LLMNL and HNL. Through extensive experiments, we reveal slight deviations (approximately 0.2 Mandelbrot exponent) from Mandelbrot's law in LLMNL, underscore a complexity advantage in HNL, and supplement an interpretive discussion on language style. This establishes a solid foundation for LLM's expansion. Further, we introduce a novel data augmentation method for few-shot text classification, termed ZGPTDA, which leverages fuzzy computing mechanisms driven by the conformity to scaling laws to make decisions about GPT-4 augmented data. Extensive experiments, conducted in real-world scenarios, confirms the effectiveness (improving F1 of Bert and RoBerta by 7-10%) and competitiveness (surpassing recent AugGPT and GENCO methods by about 2% accuracy on DeBerta) of ZGPTDA. In addition, we reveal some interesting insights, e.g., Hilberg's law and Taylor's law can impart more benefits to text classification, etc.

Computation and Language

What problem does this paper attempt to address?

### Problems the Paper Attempts to Solve The paper primarily attempts to address two core issues: 1. **Whether the natural language generated by large-scale language models truly aligns with human natural language**: - Researchers have found that although large-scale language models (LLMs) have made significant progress in generating text, it remains questionable whether the natural language generated by these models (LLMNL) truly aligns with human natural language (HNL). - To verify this, the paper introduces the "scaling law" to quantify the similarities and differences between LLMNL and HNL. - Experimental results show a slight deviation (about 0.2 Mandelbrot index) between LLMNL and HNL, indicating that HNL has a certain complexity advantage. 2. **How to improve the effectiveness of LLM-based data augmentation methods**: - In current research, data augmentation typically utilizes LLMs to generate additional training data, but this generated data may have randomness and not all of it may have equal training value. - To address this, the paper proposes a new data augmentation method—ZGPTDA, which uses a fuzzy computation mechanism to evaluate the quality of the generated text and select the most suitable augmentation data. - ZGPTDA performs excellently in multiple experiments, improving the F1 scores of BERT and RoBERTa classifiers (by 7-10%) and surpassing the accuracy of recent AugGPT and GENCO methods on DeBERTa (by about 2%). In summary, the paper aims to analyze the similarity between LLMNL and HNL through the scaling law and proposes a new data augmentation method to improve the performance of text classification tasks.

LLM-Generated Natural Language Meets Scaling Laws: New Explorations and Data Augmentation Methods

Large Language Models are legal but they are not: Making the case for a powerful LegalLLM

Has LLM Reached the Scaling Ceiling Yet? Unified Insights into LLM Regularities and Constraints

Harnessing the Power of LLMs in Practice: A Survey on ChatGPT and Beyond

Leveraging Large Language Models for NLG Evaluation: Advances and Challenges

Large Language Models (LLMs): Deployment, Tokenomics and Sustainability

LLM4DS: Evaluating Large Language Models for Data Science Code Generation

Data Augmentation using Large Language Models: Data Perspectives, Learning Paradigms and Challenges

Understanding LLMs: A Comprehensive Overview from Training to Inference

LLMs vs Established Text Augmentation Techniques for Classification: When do the Benefits Outweight the Costs?

DeepSeek LLM: Scaling Open-Source Language Models with Longtermism

Supervised Knowledge Makes Large Language Models Better In-context Learners

Temporal Scaling Law for Large Language Models

Genshin: General Shield for Natural Language Processing with Large Language Models

CodeGen2: Lessons for Training LLMs on Programming and Natural Languages

Assessing Hidden Risks of LLMs: An Empirical Study on Robustness, Consistency, and Credibility

Augmented Large Language Models with Parametric Knowledge Guiding

Leveraging Large Language Models for NLG Evaluation: A Survey

Scaling Generative Tabular Learning for Large Language Models

Breaking Language Barriers: Cross-Lingual Continual Pre-Training at Scale