TnT-LLM: Text Mining at Scale with Large Language Models

Mengting Wan,Tara Safavi,Sujay Kumar Jauhar,Yujin Kim,Scott Counts,Jennifer Neville,Siddharth Suri,Chirag Shah,Ryen W White,Longqi Yang,Reid Andersen,Georg Buscher,Dhruv Joshi,Nagu Rangan

2024-03-19

Abstract:Transforming unstructured text into structured and meaningful forms, organized by useful category labels, is a fundamental step in text mining for downstream analysis and application. However, most existing methods for producing label taxonomies and building text-based label classifiers still rely heavily on domain expertise and manual curation, making the process expensive and time-consuming. This is particularly challenging when the label space is under-specified and large-scale data annotations are unavailable. In this paper, we address these challenges with Large Language Models (LLMs), whose prompt-based interface facilitates the induction and use of large-scale pseudo labels. We propose TnT-LLM, a two-phase framework that employs LLMs to automate the process of end-to-end label generation and assignment with minimal human effort for any given use-case. In the first phase, we introduce a zero-shot, multi-stage reasoning approach which enables LLMs to produce and refine a label taxonomy iteratively. In the second phase, LLMs are used as data labelers that yield training samples so that lightweight supervised classifiers can be reliably built, deployed, and served at scale. We apply TnT-LLM to the analysis of user intent and conversational domain for Bing Copilot (formerly Bing Chat), an open-domain chat-based search engine. Extensive experiments using both human and automatic evaluation metrics demonstrate that TnT-LLM generates more accurate and relevant label taxonomies when compared against state-of-the-art baselines, and achieves a favorable balance between accuracy and efficiency for classification at scale. We also share our practical experiences and insights on the challenges and opportunities of using LLMs for large-scale text mining in real-world applications.

Computation and Language,Artificial Intelligence,Information Retrieval

What problem does this paper attempt to address?

The paper aims to address the challenges of two core tasks in text mining: taxonomy generation and text classification. Specifically, the paper addresses the following issues: 1. **High cost and time-consuming methods relying on domain experts and manual curation**: Most existing methods still heavily depend on domain expertise and manual curation when creating taxonomies and building text-based classifiers, making the entire process both expensive and time-consuming. 2. **Infeasibility of large-scale data annotation**: These issues are particularly pronounced when the label space is not well-defined or when there is a lack of large-scale data annotation. To address the above issues, the paper proposes a new framework called TnT-LLM (Taxonomy Generation and Text Classification with Large Language Models), which leverages large language models (LLMs) to automate the entire process, including: - **Zero-shot, multi-stage reasoning methods**: Used to iteratively generate and refine taxonomies to fit specific use cases (e.g., intent detection). - **LLMs as data augmenters**: Used to expand the training dataset, thereby training lightweight supervised classifiers that can be reliably deployed and scaled. The paper validates the framework by applying TnT-LLM to user intent analysis and conversation domain tagging in Microsoft Bing Copilot (formerly Bing Chat). Experimental results show that the framework can generate more accurate and relevant taxonomies compared to existing techniques, achieving a favorable balance between classification accuracy and efficiency. Additionally, the paper shares the challenges faced and insights gained from using LLMs for large-scale text mining in practical applications.

TnT-LLM: Text Mining at Scale with Large Language Models

Using Large Language Models to Generate, Validate, and Apply User Intent Taxonomies

Large Language Models Meet NLP: A Survey

AnnoLLM: Making Large Language Models to Be Better Crowdsourced Annotators

ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs

Large Language Models for Data Annotation: A Survey

Harnessing the Power of LLMs in Practice: A Survey on ChatGPT and Beyond

LLM-in-the-loop: Leveraging Large Language Model for Thematic Analysis

Unleashing the Potential of Large Language Models for Predictive Tabular Tasks in Data Science

TPTU: Task Planning and Tool Usage of Large Language Model-based AI Agents

TPTU: Large Language Model-based AI Agents for Task Planning and Tool Usage

Interactive-T2S: Multi-Turn Interactions for Text-to-SQL with Large Language Models

Making Large Language Models Better Data Creators

LTNER: Large Language Model Tagging for Named Entity Recognition with Contextualized Entity Marking

Large Language Models for Data Annotation and Synthesis: A Survey

Adaptable and Reliable Text Classification using Large Language Models

Large Language Models for Social Networks: Applications, Challenges, and Solutions

Large Language Models Offer an Alternative to the Traditional Approach of Topic Modelling

MLCopilot: Unleashing the Power of Large Language Models in Solving Machine Learning Tasks

LLM-augmented Preference Learning from Natural Language