Abstract:Data-to-text generation plays an important role in natural language processing by processing structured data and helping people understand those data by generating user-friendly descriptive text. It can be applied to news generation, financial report generation, customer service, etc. However, in practice, it needs to adapt to different domains that may lack an annotated training corpus. To alleviate this dataset scarcity problem, distantly-supervised data-to-text generation has emerged, which constructs a training corpus automatically and is more practical to apply to new domains when well-aligned data is expensive to obtain. However, this distant supervision method of training induces an over-generation problem since the automatically aligned text includes hallucination. These expressions cannot be inferred from the data, misguiding the model to produce unfaithful text. To exploit the noisy dataset while maintaining faithfulness, we empower the neural data-to-text model by dynamically increasing the weights of those well-aligned training instances and reducing the weights of the low-quality ones via meta learning. To our best knowledge, we are the first to alleviate the noise in distantly-supervised data-to-text generation via meta learning. In addition, we rewrite those low-quality texts to provide better training instances. Finally, we construct a new distantly-supervised dataset, DIST-ToTTo (abbreviation for Distantly-supervised Table-To-Text), and conduct experiments on both the benchmark WITA (abbreviation for the data source Wikipedia and Wikidata) and DIST-ToTTo datasets. The evaluation results show that our model can improve the state-of-the-art DSG (abbreviation for Distant Supervision Generation) model across all automatic evaluation metrics, with an improvement of 3.72% on the WITA dataset and 3.82% on the DIST-ToTTo dataset in terms of the widely used metric BLEU (abbreviation for BiLingual Evaluation Understudy). Furthermore, based on human evaluation, our model can generate more grammatically correct and more faithful text compared to the state-of-the-art DSG model.

Neural Data-to-Text Generation Based on Small Datasets: Comparing the Added Value of Two Semi-Supervised Learning Approaches on Top of a Large Language Model

Semi-Supervised Neural Text Generation by Joint Learning of Natural Language Generation and Natural Language Understanding Models

Neural Semi-supervised Learning for Text Classification Under Large-Scale Pretraining

A Semi-Supervised Approach for Low-Resourced Text Generation.

Improving Text Classification with Large Language Model-Based Data Augmentation

SDA: Improving Text Generation with Self Data Augmentation

Enhancing Text Generation in Joint NLG/NLU Learning Through Curriculum Learning, Semi-Supervised Training, and Advanced Optimization Techniques

Quality Control for Distantly-Supervised Data-to-Text Generation Via Meta Learning

Leveraging Natural Supervision for Language Representation Learning and Generation

Text Generation with Speech Synthesis for ASR Data Augmentation

CMMQC: Cascaded Multi-Model Quality Control for Unsupervised Data-to-Text Generation

Semi-Supervised Learning for Neural Machine Translation

AUGNLG: Few-shot Natural Language Generation using Self-trained Data Augmentation

Neural Data Augmentation for Legal Overruling Task: Small Deep Learning Models vs. Large Language Models

Curriculum-Based Self-Training Makes Better Few-Shot Learners for Data-to-Text Generation

Have Your Text and Use It Too! End-to-End Neural Data-to-Text Generation with Semantic Fidelity

EvoText: Enhancing Natural Language Generation Models via Self-Escalation Learning for Up-to-Date Knowledge and Improved Performance

Bring Your Own Data! Self-Supervised Evaluation for Large Language Models

Weakly-Supervised Neural Text Classification

Dual Learning for Semi-Supervised Natural Language Understanding