Abstract:Background: Substance misuse presents significant global public health challenges. Understanding transitions between substance types and the timing of shifts to polysubstance use is vital to developing effective prevention and recovery strategies. The gateway hypothesis suggests that high-risk substance use is preceded by lower-risk substance use. However, the source of this correlation is hotly contested. While some claim that low-risk substance use causes subsequent, riskier substance use, most people using low-risk substances also do not escalate to higher-risk substances. Social media data hold the potential to shed light on the factors contributing to substance use transitions. Objective: By leveraging social media data, our study aimed to gain a better understanding of substance use pathways. By identifying and analyzing the transitions of individuals between different risk levels of substance use, our goal was to find specific linguistic cues in individuals' social media posts that could indicate escalating or de-escalating patterns in substance use. Methods: We conducted a large-scale analysis using data from Reddit, collected between 2015 and 2019, consisting of over 2.29 million posts and approximately 29.37 million comments by around 1.4 million users from subreddits. These data, derived from substance use subreddits, facilitated the creation of a risk transition data set reflecting the substance use behaviors of over 1.4 million users. We deployed deep learning and machine learning techniques to predict the escalation or de-escalation transitions in risk levels, based on initial transition phases documented in posts and comments. We conducted a linguistic analysis to analyze the language patterns associated with transitions in substance use, emphasizing the role of n-gram features in predicting future risk trajectories. Results: Our results showed promise in predicting the escalation or de-escalation transition in risk levels, based on the historical data of Reddit users created on initial transition phases among drug-related subreddits, with an accuracy of 78.48% and an F1-score of 79.20%. We highlighted the vital predictive features, such as specific substance names and tools indicative of future risk escalations. Our linguistic analysis showed that terms linked with harm reduction strategies were instrumental in signaling de-escalation, whereas descriptors of frequent substance use were characteristic of escalating transitions. Conclusions: This study sheds light on the complexities surrounding the gateway hypothesis of substance use through an examination of web-based behavior on Reddit. While certain findings validate the hypothesis, indicating a progression from lower-risk substances such as marijuana to higher-risk ones, a significant number of individuals did not show this transition. The research underscores the potential of using machine learning with social media analysis to predict substance use transitions. Our results point toward future directions for leveraging social media data in substance use research, underlining the importance of continued exploration before suggesting direct implications for interventions.

Practical foundations of machine learning for addiction research. Part I. Methods and techniques

Machine learning with neuroimaging biomarkers: Application in the diagnosis and prediction of drug addiction

Craving for a Robust Methodology: A Systematic Review of Machine Learning Algorithms on Substance-Use Disorders Treatment Outcomes

Applying Machine Learning Approaches to Suicide Prediction Using Healthcare Data: Overview and Future Directions

Multimodal-based machine learning approach to classify features of internet gaming disorder and alcohol use disorder: A sensor-level and source-level resting-state electroencephalography activity and neuropsychological study

A primer on the use of machine learning to distil knowledge from data in biological psychiatry

Machine Learning Analysis of Cocaine Addiction Informed by DAT, SERT, and NET-Based Interactome Networks

Genomic and Personalized Medicine Approaches for Substance Use Disorders (SUDs) Looking at Genome-Wide Association Studies

Artificial Intelligence-driven and technological innovations in the diagnosis and management of substance use disorders

Machine Learning of Functional Connectivity to Biotype Alcohol and Nicotine Use Disorders

The promise of machine learning in predicting treatment outcomes in psychiatry

Proteome-informed machine learning studies of cocaine addiction

Application of bi-modal signal in the classification and recognition of drug addiction degree based on machine learning

Digital Traces from a Substance Use Disorder Forum: Using Machine Learning of Online Expression to Explain Recovery Trajectories (Preprint)

Revolutionizing Addiction Medicine: The Role of Artificial Intelligence

Getting Started with Machine Learning for Experimental Biochemists and Other Molecular Scientists

Leveraging Large Language Models to Extract Information on Substance Use Disorder Severity from Clinical Notes: A Zero-shot Learning Approach

Examining the Gateway Hypothesis and Mapping Substance Use Pathways on Social Media: Machine Learning Approach

Utilizing deep learning and graph mining to identify drug use on Twitter data

Problematic internet use (PIU): Associations with the impulsive-compulsive spectrum. An application of machine learning in psychiatry

Neuroimaging Biomarkers in Addiction