Abstract:Objectives: Generative large language models (LLMs) are a subset of transformers-based neural network architecture models. LLMs have successfully leveraged a combination of an increased number of parameters, improvements in computational efficiency, and large pre-training datasets to perform a wide spectrum of natural language processing (NLP) tasks. Using a few examples (few-shot) or no examples (zero-shot) for prompt-tuning has enabled LLMs to achieve state-of-the-art performance in a broad range of NLP applications. This article by the American Medical Informatics Association (AMIA) NLP Working Group characterizes the opportunities, challenges, and best practices for our community to leverage and advance the integration of LLMs in downstream NLP applications effectively. This can be accomplished through a variety of approaches, including augmented prompting, instruction prompt tuning, and reinforcement learning from human feedback (RLHF). Target audience: Our focus is on making LLMs accessible to the broader biomedical informatics community, including clinicians and researchers who may be unfamiliar with NLP. Additionally, NLP practitioners may gain insight from the described best practices. Scope: We focus on 3 broad categories of NLP tasks, namely natural language understanding, natural language inferencing, and natural language generation. We review the emerging trends in prompt tuning, instruction fine-tuning, and evaluation metrics used for LLMs while drawing attention to several issues that impact biomedical NLP applications, including falsehoods in generated text (confabulation/hallucinations), toxicity, and dataset contamination leading to overfitting. We also review potential approaches to address some of these current challenges in LLMs, such as chain of thought prompting, and the phenomena of emergent capabilities observed in LLMs that can be leveraged to address complex NLP challenge in biomedical applications.

Best Practices for Text Annotation with Large Language Models

Large Language Models for Data Annotation: A Survey

Large Language Models for Data Annotation and Synthesis: A Survey

Prompt Refinement or Fine-tuning? Best Practices for using LLMs in Computational Social Science Tasks

How to use LLMs for Text Analysis

Evaluation is all you need. Prompting Generative Large Language Models for Annotation Tasks in the Social Sciences. A Primer using Open Models

Shaping the Emerging Norms of Using Large Language Models in Social Computing Research

An Interdisciplinary Outlook on Large Language Models for Scientific Research

Large Language Models as Financial Data Annotators: A Study on Effectiveness and Efficiency

Open-Source LLMs for Text Annotation: A Practical Guide for Model Setting and Fine-Tuning

Large language models for biomedicine: foundations, opportunities, challenges, and best practices

The Effectiveness of LLMs as Annotators: A Comparative Overview and Empirical Analysis of Direct Representation

Apprentices to Research Assistants: Advancing Research with Large Language Models

Large language models and academic writing: Five tiers of engagement

The long but necessary road to responsible use of large language models in healthcare research

Are LLMs Better than Reported? Detecting Label Errors and Mitigating Their Effect on Model Performance

Practical Applications of Large Language Models for Health Care Professionals and Scientists

LLMs as Research Tools: A Large Scale Survey of Researchers' Usage and Perceptions

Theoretical and Methodological Framework for Studying Texts Produced by Large Language Models

Are Large Language Models Reliable Argument Quality Annotators?

Large Language Models in Qualitative Research: Can We Do the Data Justice?