Abstract:The performance of large language models (LLMs) in natural language processing (NLP) tasks is significantly influenced by the quality and diversity of data used for supervised fine-tuning (SFT). Current data selection methods often focus solely on quality or diversity, leading to underperforming models due to suboptimal training data. In this paper, we introduce GraphFilter, a novel method that represents the dataset as a bipartite graph, linking sentences to their constituent n-grams. This representation effectively captures the relationships between sentences and linguistic patterns, facilitating the selection of sentences that enhance n-gram diversity. To balance quality and diversity during selection, we propose a priority function that combines the quality metric with the diversity metric in a multiplicative manner. GraphFilter iteratively selects high-priority sentences, updates the bipartite graph by removing covered n-grams, and re-calculates priorities to reflect the evolving data landscape. We conduct extensive experiments using three model backbones across six widely used benchmarks. The results demonstrate that GraphFilter outperforms all nine baseline approaches, achieving superior model performance and computational efficiency. Our analyses validate the effectiveness of our design choices, examine the subsets selected by GraphFilter and other methods, highlight the importance of instruction diversity, and explore the role of quality and diversity in relation to subset sizes. GraphFilter establishes a new foundation for effective data selection strategies, encouraging further research in data selection for LLMs.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: In natural language processing (NLP) tasks, the performance of large - language models (LLMs) is significantly affected by the quality and diversity of the data used for supervised fine - tuning (SFT). Current data selection methods usually focus only on quality or diversity, resulting in sub - optimal training data and thus poor model performance. Therefore, the paper proposes a new method to balance the quality and diversity of data in order to optimize the performance of LLMs. Specifically, the paper points out: 1. **Limitations of existing methods**: Existing data selection methods either focus too much on quality and ignore the diversity of language patterns, thus affecting the generalization ability of the model; or over - emphasize diversity and introduce low - quality data, thus reducing the overall performance of the model. 2. **Research objective**: To overcome these limitations, the paper proposes a new method named GRAPH FILTER, which aims to simultaneously optimize the quality and diversity of data selection by constructing a bipartite graph model, thereby improving the performance of LLMs. ### Main contributions of GRAPH FILTER 1. **Bipartite graph model**: Represent the data set as a bipartite graph, where sentences and n - grams are respectively used as two independent sets of nodes, and edges represent the occurrence of n - grams in sentences. This method can effectively capture the relationship between sentences and language patterns, facilitating the selection of sentences that can enhance the diversity of n - grams. 2. **Priority function**: Propose a priority function that combines quality and diversity metrics. This function combines quality metrics and diversity metrics in a multiplicative way, ensuring that the selected sentences are both of high quality and make a significant contribution to the diversity of n - grams. 3. **Iterative selection algorithm**: GRAPH FILTER iteratively selects high - priority sentences, updates the bipartite graph and recalculates the priority to reflect the changes in the data landscape. This enables the algorithm to be dynamically adjusted during the selection process, ensuring that the finally selected subset covers a wide range of language patterns while maintaining high quality. ### Experimental verification The paper verifies the effectiveness of GRAPH FILTER through extensive experiments. The experiments use three different model skeletons (GEMMA - 2 - 2B, MISTRAL - 7B - V0.3, LLAMA - 3 - 8B) and are evaluated on six widely - used benchmarks. The results show that GRAPH FILTER performs well among all nine baseline methods, not only significantly improving the model performance but also achieving better computational efficiency. ### Summary In general, by introducing GRAPH FILTER, this paper provides an effective method to balance the quality and diversity in data selection, thereby optimizing the performance of LLMs. This method is not only innovative in theory but also shows strong advantages in practical applications.

The Best of Both Worlds: Bridging Quality and Diversity in Data Selection with Bipartite Graph

Diversify and Conquer: Diversity-Centric Data Selection with Iterative Refinement

ScalingFilter: Assessing Data Quality through Inverse Utilization of Scaling Laws

Knowledge Graph Embedding with Diversity of Structures

Graph Neural Networks with Diverse Spectral Filtering

B2-Sampling: Fusing Balanced and Biased Sampling for Graph Contrastive Learning

Diversifying Collaborative Filtering via Graph Spreading Network and Selective Sampling

Multi-View Bipartite Graph Clustering With Coupled Noisy Feature Filter

On the Diversity of Synthetic Data and its Impact on Training Large Language Models

Amplifying Diversity and Quality in Commonsense Knowledge Graph Completion (student Abstract)

Finding Representative and Diverse Vertices within Graphs

Harnessing Diversity for Important Data Selection in Pretraining Large Language Models

Improving Data Efficiency via Curating LLM-Driven Rating Systems

Addressing Shortcomings in Fair Graph Learning Datasets: Towards a New Benchmark

Efficient and effective training of language and graph neural network models

Bridging Large Language Models and Graph Structure Learning Models for Robust Representation Learning

When Heterophily Meets Heterogeneity: New Graph Benchmarks and Effective Methods

Rethinking Data Selection at Scale: Random Selection is Almost All You Need

Selective Annotation Makes Language Models Better Few-Shot Learners

On finding bicliques in bipartite graphs: a novel algorithm and its application to the integration of diverse biological data types