Abstract:Background: Large language models including GPT-4 (OpenAI) have opened new avenues in health care and qualitative research. Traditional qualitative methods are time-consuming and require expertise to capture nuance. Although large language models have demonstrated enhanced contextual understanding and inferencing compared with traditional natural language processing, their performance in qualitative analysis versus that of humans remains unexplored. Objective: We evaluated the effectiveness of GPT-4 versus human researchers in qualitative analysis of interviews with patients with adult-acquired buried penis (AABP). Methods: Qualitative data were obtained from semistructured interviews with 20 patients with AABP. Human analysis involved a structured 3-stage process-initial observations, line-by-line coding, and consensus discussions to refine themes. In contrast, artificial intelligence (AI) analysis with GPT-4 underwent two phases: (1) a naïve phase, where GPT-4 outputs were independently evaluated by a blinded reviewer to identify themes and subthemes and (2) a comparison phase, where AI-generated themes were compared with human-identified themes to assess agreement. We used a general qualitative description approach. Results: The study population (N=20) comprised predominantly White (17/20, 85%), married (12/20, 60%), heterosexual (19/20, 95%) men, with a mean age of 58.8 years and BMI of 41.1 kg/m2. Human qualitative analysis identified "urinary issues" in 95% (19/20) and GPT-4 in 75% (15/20) of interviews, with the subtheme "spray or stream" noted in 60% (12/20) and 35% (7/20), respectively. "Sexual issues" were prominent (19/20, 95% humans vs 16/20, 80% GPT-4), although humans identified a wider range of subthemes, including "pain with sex or masturbation" (7/20, 35%) and "difficulty with sex or masturbation" (4/20, 20%). Both analyses similarly highlighted "mental health issues" (11/20, 55%, both), although humans coded "depression" more frequently (10/20, 50% humans vs 4/20, 20% GPT-4). Humans frequently cited "issues using public restrooms" (12/20, 60%) as impacting social life, whereas GPT-4 emphasized "struggles with romantic relationships" (9/20, 45%). "Hygiene issues" were consistently recognized (14/20, 70% humans vs 13/20, 65% GPT-4). Humans uniquely identified "contributing factors" as a theme in all interviews. There was moderate agreement between human and GPT-4 coding (κ=0.401). Reliability assessments of GPT-4's analyses showed consistent coding for themes including "body image struggles," "chronic pain" (10/10, 100%), and "depression" (9/10, 90%). Other themes like "motivation for surgery" and "weight challenges" were reliably coded (8/10, 80%), while less frequent themes were variably identified across multiple iterations. Conclusions: Large language models including GPT-4 can effectively identify key themes in analyzing qualitative health care data, showing moderate agreement with human analysis. While human analysis provided a richer diversity of subthemes, the consistency of AI suggests its use as a complementary tool in qualitative research. With AI rapidly advancing, future studies should iterate analyses and circumvent token limitations by segmenting data, furthering the breadth and depth of large language model-driven qualitative analyses.

Comparing the Efficacy of GPT-4 and Chat-GPT in Mental Health Care: A Blind Assessment of Large Language Models for Psychological Support

GPT is an effective tool for multilingual psychological text analysis

Insights into the potential benefits and challenges of AI-driven large language models/ChatGPT-4 for predicting Autism Spectrum Disorder

Can Large Language Models be Used to Provide Psychological Counselling? An Analysis of GPT-4-Generated Responses Using Role-play Dialogues

How ChatGPT works: a mini review

Evaluation of ChatGPT for NLP-based Mental Health Applications

Can AI Relate: Testing Large Language Model Response for Mental Health Support

Large Language Models for Individualized Psychoeducational Tools for Psychosis: A cross-sectional study

Risk of adverse outcome in patients referred by emergency medical services in Switzerland.

Applications of large language models in psychiatry: a systematic review

Application, investigation and prediction of ChatGpt/GPT-4 for clinical cases in medical field

Can ChatGPT provide a better support: a comparative analysis of ChatGPT and dataset responses in mental health dialogues

Evaluating cognitive performance: Traditional methods vs. ChatGPT

An Assessment on Comprehending Mental Health through Large Language Models

ChatGPT Assisting Diagnosis of Neuro-ophthalmology Diseases Based on Case Reports

The Now and Future of ChatGPT and GPT in Psychiatry

Beyond rating scales: With targeted evaluation, large language models are poised for psychological assessment

The Clinical Utility of Large Language Models in Diagnosing Neurocognitive Disorders among NACC Participants

Comparing GPT-4 and Human Researchers in Health Care Data Analysis: Qualitative Description Study

Leveraging ChatGPT to optimize depression intervention through explainable deep learning

Separation of effector cells mediating antibody-dependent cellular cytotoxicity (ADC) to erythrocyte targets from those mediating ADC to tumor targets.