Abstract:"Gold" and "ground truth" human-mediated labels have error. The effects of this error can escape commonly reported metrics of label quality or obscure questions of accuracy, bias, fairness, and usefulness during model evaluation. This study demonstrates methods for answering such questions even in the context of very low reliabilities from expert humans. We analyze human labels, GPT model ratings, and transformer encoder model annotations describing the quality of classroom teaching, an important, expensive, and currently only human task. We answer the question of whether such a task can be automated using two Large Language Model (LLM) architecture families--encoders and GPT decoders, using novel approaches to evaluating label quality across six dimensions: Concordance, Confidence, Validity, Bias, Fairness, and Helpfulness. First, we demonstrate that using standard metrics in the presence of poor labels can mask both label and model quality: the encoder family of models achieve state-of-the-art, even "super-human", results across all classroom annotation tasks. But not all these positive results remain after using more rigorous evaluation measures which reveal spurious correlations and nonrandom racial biases across models and humans. This study then expands these methods to estimate how model use would change to human label quality if models were used in a human-in-the-loop context, finding that the variance captured in GPT model labels would worsen reliabilities for humans influenced by these models. We identify areas where some LLMs, within the generalizability of the current data, could improve the quality of expensive human ratings of classroom instruction.

AutoEval Done Right: Using Synthetic Data for Model Evaluation

Best Practices and Lessons Learned on Synthetic Data

Synthetic Data for Model Selection

Are Labels Always Necessary for Classifier Accuracy Evaluation?

Self-Taught Evaluators

GANs in the Panorama of Synthetic Data Generation Methods

Energy-based Automated Model Evaluation

A Study on Improving Realism of Synthetic Data for Machine Learning

A Multi-Faceted Evaluation Framework for Assessing Synthetic Data Generated by Large Language Models

"All that Glitters": Approaches to Evaluations with Unreliable Model and Human Annotations

Using GPT-2 to Create Synthetic Data to Improve the Prediction Performance of NLP Machine Learning Classification Models

Better Synthetic Data by Retrieving and Transforming Existing Datasets

Artificial Data, Real Insights: Evaluating Opportunities and Risks of Expanding the Data Ecosystem with Synthetic Data

Keeping Humans in the Loop: Human-Centered Automated Annotation with Generative AI

Active Testing: Sample-Efficient Model Evaluation

Bring Your Own Data! Self-Supervised Evaluation for Large Language Models

Multi-Armed Bandit Approach for Optimizing Training on Synthetic Data

Exploring the Potential of Synthetic Data to Replace Real Data

The Parrot Dilemma: Human-Labeled vs. LLM-augmented Data in Classification Tasks

Fill In The Gaps: Model Calibration and Generalization with Synthetic Data

Analyzing Effects of Fake Training Data on the Performance of Deep Learning Systems