Abstract:Large Language Models(LLMs) excel in general tasks but struggle in specialized domains like healthcare due to limited domain-specific <a class="link-external link-http" href="http://knowledge.Supervised" rel="external noopener nofollow">this http URL</a> Fine-Tuning(SFT) data construction for domain adaptation often relies on heuristic methods, such as GPT-4 annotation or manual data selection, with a data-centric focus on presumed diverse, high-quality datasets. However, these methods overlook the model's inherent knowledge distribution, introducing noise, redundancy, and irrelevant data, leading to a mismatch between the selected data and the model's learning task, resulting in suboptimal performance. To address this, we propose a two-stage model-centric data selection framework, Decomposed Difficulty Data Selection (3DS), which aligns data with the model's knowledge distribution for optimized adaptation. In Stage1, we apply Prompt-Driven Data Selection via Explicit Alignment, where the the model filters irrelevant or redundant data based on its internal knowledge. In Stage2, we perform Decomposed Difficulty Data Selection, where data selection is guided by our defined difficulty decomposition, using three metrics: Instruction Understanding, Response Confidence, and Response Correctness. Additionally, an attention-based importance weighting mechanism captures token importance for more accurate difficulty calibration. This two-stage approach ensures the selected data is not only aligned with the model's knowledge and preferences but also appropriately challenging for the model to learn, leading to more effective and targeted domain adaptation. In the case study of the medical domain, our extensive experiments on real-world healthcare datasets demonstrate the superiority of 3DS over exisiting methods in accuracy by over 5.29%. Our dataset and code will be open-sourced at <a class="link-external link-https" href="https://anonymous.4open.science/r/3DS-E67F" rel="external noopener nofollow">this https URL</a>.

Icpe: A Hybrid Data Selection Model For Smt Domain Adaptation

A Systematic Comparison of Data Selection Criteria for SMT Domain Adaptation

Edit Distance: A New Data Selection Criterion for Domain Adaptation in SMT.

Towards Self-Similarity Consistency and Feature Discrimination for Unsupervised Domain Adaptation.

Data Selection Via Semi-supervised Recursive Autoencoders for SMT Domain Adaptation

Topic Model Based Adaptation Data Selection for Domain-Specific Machine Translation.

SMT Domain Adaptation Based on Monolingual Context Information

Data Selection via Optimal Control for Language Models

Connecting Phrase Based Statistical Machine Translation Adaptation.

An Empirical Investigation of Domain Adaptation Ability for Chinese Spelling Check Models

Machine Translation: 15th China Conference, CCMT 2019, Nanchang, China, September 27–29, 2019, Revised Selected Papers

Measuring Domain Similarity for Statistical Machine Translation

Bilingual Recursive Neural Network Based Data Selection for Statistical Machine Translation

A discriminative model selection approach and its application to text classification

CCDC: A Chinese-Centric Cross Domain Contrastive Learning Framework

EM-based Hybrid Model for Bilingual Terminology Extraction from Comparable Corpora.

SSE: Multimodal Semantic Data Selection and Enrichment for Industrial-scale Data Assimilation

Combining Statistical Model and Dictionary for Domain Adaption of Chinese Word Segmentation

A Graph-based Bilingual Corpus Selection Approach for SMT.

Adversarial Domain Adaptation For Chinese Semantic Dependency Graph Parsing

3DS: Decomposed Difficulty Data Selection's Case Study on LLM Medical Domain Adaptation