Abstract:This thesis provides methods and analysis of models which make progress on this goal. The techniques outlined are task agnostic, and should provide benefit when used with nearly any transformer LM. We introduce two new finetuning methods which add new capabilities to the models they are used on. The first adds a recurrence mechanism, which removes the fixed-window sized constraint and improves the efficiency of a transformer decoder. The second allows masked language models (MLMs) to be used for initialization of both the encoder and decoder of a non-autoregressive sequence-to-sequence transformer, opening up generative applications of models which were previously only used for natural language understanding tasks. We also introduce two new techniques for improving the quality of predictions of any transformer decoder without additional finetuning. One, hidden state optimization, can be applied to any transformer decoder to improve the quality of predictions at inference time, especially for few-shot classification. The other, conditional beam search, allows practitioners to search for natural language generation (NLG) model outputs with high likelihood while conditioning on the event that the output is not degenerate (e.g. empty, repetitive, etc.). Finally, we provide theoretical and empirical insights on the divergence of model-likelihood and output quality which has widely been observed in prior work. These insights apply to any model which represents a distribution over text, and apply to language models which are not transformers or even autoregressive. We argue that the NLP community has, to some extent, misunderstood the implications of these findings, and encourage a point of view which has more nuance.

Do You Have the Right Scissors? Tailoring Pre-trained Language Models via Monte-Carlo Methods

MolTailor: Tailoring Chemical Molecular Representation to Specific Tasks via Text Prompts

Mechanistic Behavior Editing of Language Models

Automated Data Curation for Robust Language Model Fine-Tuning

Cross-model Control: Improving Multiple Large Language Models in One-time Training

Tailoring Language Generation Models under Total Variation Distance

Improving Text Embeddings for Smaller Language Models Using Contrastive Fine-tuning

Model Tailor: Mitigating Catastrophic Forgetting in Multi-modal Large Language Models

Making Language Models Better Tool Learners with Execution Feedback

Distilling Knowledge Learned in BERT for Text Generation

Benchmarking Middle-Trained Language Models for Neural Search

Making the Most of your Model: Methods for Finetuning and Applying Pretrained Transformers

Empirical Analysis of Efficient Fine-Tuning Methods for Large Pre-Trained Language Models

TransTailor: Pruning the Pre-trained Model for Improved Transfer Learning

Better Language Models of Code through Self-Improvement

A Reproducibility Study of Goldilocks: Just-Right Tuning of BERT for TAR

Sparse is Enough in Fine-tuning Pre-trained Large Language Models

Improving Language Understanding by Generative Pre-Training

AnyTaskTune: Advanced Domain-Specific Solutions through Task-Fine-Tuning

Self-Distillation Bridges Distribution Gap in Language Model Fine-Tuning

Language Anisotropic Cross-Lingual Model Editing