Abstract:Large Language Models (LLMs) demonstrate remarkable capabilities in replicating human tasks and boosting productivity. However, their direct application for data extraction presents limitations due to a prioritisation of fluency over factual accuracy and a restricted ability to manipulate specific information. Therefore to overcome these limitations, this research leverages the knowledge representation power of pre-trained LLMs and the targeted information access enabled by RAG models, this research investigates a general-purpose accurate data scraping recipe for RAG models designed for language generation. To capture knowledge in a more modular and interpretable way, we use pre trained language models with a latent knowledge retriever, which allows the model to retrieve and attend over documents from a large corpus. We utilised RAG model architecture and did an in-depth analysis of their capabilities under three tasks: (i) Semantic Classification of HTML elements, (ii) Chunking HTML text for effective understanding, and (iii) comparing results from different LLMs and ranking algorithms. While previous work has developed dedicated architectures and training procedures for HTML understanding and extraction, we show that LLMs pre-trained on standard natural language with an addition of effective chunking, searching and ranking algorithms, can prove to be efficient data scraping tool to extract complex data from unstructured text. Future research directions include addressing the challenges of provenance tracking and dynamic knowledge updates within the proposed RAG-based data extraction framework. By overcoming these limitations, this approach holds the potential to revolutionise data extraction from vast repositories of textual information.

Self-Training for Label-Efficient Information Extraction from Semi-Structured Web-Pages

Label-Efficient Self-Training for Attribute Extraction from Semi-Structured Web Documents

Learning to Label with Active Learning and Reinforcement Learning.

Data-Efficient Information Extraction from Form-Like Documents

Efficient Data Learning for Open Information Extraction with Pre-trained Language Models

SelfLRE: Self-refining Representation Learning for Low-resource Relation Extraction

Semi-supervised Label Enhancement Via Structured Semantic Extraction

A pre-training and self-training approach for biomedical named entity recognition

Large Language Model Is Not a Good Few-shot Information Extractor, but a Good Reranker for Hard Samples!

Extreme Multi-Label Skill Extraction Training using Large Language Models

Gradient Imitation Reinforcement Learning for General Low-Resource Information Extraction

Gradient Imitation Reinforcement Learning for General Low-Resource Information Extraction.

Making Large Language Models Better Data Creators

Large Language Model-guided Document Selection

Leveraging Large Language Models for Web Scraping

Self-training Large Language Models through Knowledge Detection

EIGEN: Expert-Informed Joint Learning Aggregation for High-Fidelity Information Extraction from Document Images

Leveraging Web-Crawled Data for High-Quality Fine-Tuning

Self-Tuning: Instructing LLMs to Effectively Acquire New Knowledge through Self-Teaching

MetaIE: Distilling a Meta Model from LLM for All Kinds of Information Extraction Tasks

SELF-GUIDE: Better Task-Specific Instruction Following via Self-Synthetic Finetuning