LLM-Generated Natural Language Meets Scaling Laws: New Explorations and Data Augmentation Methods

Zhenhua Wang,Guang Xu,Ming Ren
2024-06-29
Abstract:With the ascent of large language models (LLM), natural language processing has witnessed enhancements, such as LLM-based data augmentation. Nonetheless, prior research harbors two primary concerns: firstly, a lack of contemplation regarding whether the natural language generated by LLM (LLMNL) truly aligns with human natural language (HNL), a critical foundational question; secondly, an oversight that augmented data is randomly generated by LLM, implying that not all data may possess equal training value, that could impede the performance of classifiers. To address these challenges, we introduce the scaling laws to intrinsically calculate LLMNL and HNL. Through extensive experiments, we reveal slight deviations (approximately 0.2 Mandelbrot exponent) from Mandelbrot's law in LLMNL, underscore a complexity advantage in HNL, and supplement an interpretive discussion on language style. This establishes a solid foundation for LLM's expansion. Further, we introduce a novel data augmentation method for few-shot text classification, termed ZGPTDA, which leverages fuzzy computing mechanisms driven by the conformity to scaling laws to make decisions about GPT-4 augmented data. Extensive experiments, conducted in real-world scenarios, confirms the effectiveness (improving F1 of Bert and RoBerta by 7-10%) and competitiveness (surpassing recent AugGPT and GENCO methods by about 2% accuracy on DeBerta) of ZGPTDA. In addition, we reveal some interesting insights, e.g., Hilberg's law and Taylor's law can impart more benefits to text classification, etc.
Computation and Language
What problem does this paper attempt to address?
### Problems the Paper Attempts to Solve The paper primarily attempts to address two core issues: 1. **Whether the natural language generated by large-scale language models truly aligns with human natural language**: - Researchers have found that although large-scale language models (LLMs) have made significant progress in generating text, it remains questionable whether the natural language generated by these models (LLMNL) truly aligns with human natural language (HNL). - To verify this, the paper introduces the "scaling law" to quantify the similarities and differences between LLMNL and HNL. - Experimental results show a slight deviation (about 0.2 Mandelbrot index) between LLMNL and HNL, indicating that HNL has a certain complexity advantage. 2. **How to improve the effectiveness of LLM-based data augmentation methods**: - In current research, data augmentation typically utilizes LLMs to generate additional training data, but this generated data may have randomness and not all of it may have equal training value. - To address this, the paper proposes a new data augmentation method—ZGPTDA, which uses a fuzzy computation mechanism to evaluate the quality of the generated text and select the most suitable augmentation data. - ZGPTDA performs excellently in multiple experiments, improving the F1 scores of BERT and RoBERTa classifiers (by 7-10%) and surpassing the accuracy of recent AugGPT and GENCO methods on DeBERTa (by about 2%). In summary, the paper aims to analyze the similarity between LLMNL and HNL through the scaling law and proposes a new data augmentation method to improve the performance of text classification tasks.