Large language models can help with biostatistics and coding needed in radiology research

Adarsh Ghosh,Hailong Li,Andrew T Trout
DOI: https://doi.org/10.1016/j.acra.2024.09.042
2024-10-14
Abstract:Introduction: Original research in radiology often involves handling large datasets, data manipulation, statistical tests, and coding. Recent studies show that large language models (LLMs) can solve bioinformatics tasks, suggesting their potential in radiology research. This study evaluates an LLM's ability to provide statistical and deep learning solutions and code for radiology research. Materials and methods: We used web-based chat interfaces available for ChatGPT-4o, ChatGPT-3.5, and Google Gemini. EXPERIMENT 1: BIOSTATISTICS AND DATA VISUALIZATION: We assessed each LLMs' ability to suggest biostatistical tests and generate R code for the same using a Cancer Imaging Archive dataset. Prompts were based on statistical analyses from a peer-reviewed manuscript. The generated code was tested in R Studio for correctness, runtime errors and the ability to generate the requested visualization. EXPERIMENT 2: DEEP LEARNING: We used the RSNA-STR Pneumonia Detection Challenge dataset to evaluate ChatGPT-4o and Gemini's ability to generate Python code for transformer-based image classification models (Vision Transformer ViT-B/16). The generated code was tested in a Jupiter Notebook for functionality and run time errors. Results: Out of the 8 statistical questions posed, correct statistical answers were suggested for 7 (ChatGPT-4o), 6 (ChatGPT-3.5), and 5 (Gemini) scenarios. The R code output by ChatGPT-4o had fewer runtime errors (6 out of the 7 total codes provided) compared to ChatGPT-3.5 (5/7) and Gemini (5/7). Both ChatGPT4o and Gemini were able to generate visualization requested with a few run time errors. Iteratively copying runtime errors from the code generated by ChatGPT4o into the chat helped resolve them. Gemini initially hallucinated during code generation but was able to provide accurate code on restarting the experiment. ChatGPT4-o and Gemini successfully generated initial Python code for deep learning tasks. Errors encountered during implementation were resolved through iterations using the chat interface, demonstrating LLM utility in providing baseline code for further code refinement and resolving run time errors. Conclusion: LLMs can assist in coding tasks for radiology research, providing initial code for data visualization, statistical tests, and deep learning models helping researchers with foundational biostatistical knowledge. While LLM can offer a useful starting point, they require users to refine and validate the code and caution is necessary due to potential errors, the risk of hallucinations and data privacy regulations. Summary statement: LLMs can help with coding and statistical problems in radiology research. This can help primary authors trouble shoot coding needed in radiology research.
What problem does this paper attempt to address?