Marijuana Addiction Prediction Models by Gender in Young Adults Using Random Forest

A woman holding a coffee mug uses a laptop computer

Citation: Choi, J., Jung, H T, & Choi, J. (2021). Marijuana addiction prediction models by gender in young adults using random forest. Online Journal of Nursing Informatics (OJNI), 25 (2).


Background: Research indicates that marijuana is the most used illegal substance in states where this substance remains criminalized, with increasing use among young adults. There are important gender differences in substance use behavior. Specific risk variables implying different marijuana use behavior by gender are unclear. 

Method: This was a data mining study using machine learning. Random Forest, a machine learning algorithm, was used to build prediction models and a Minimum Redundancy Maximum Relevance feature (variable) selection to identify important risk variables by gender in young adults who abuse marijuana. The Diagnostic and Statistical Manual of Mental Disorders, 5th edition (DSM-5) (American Psychiatric Association, 2013) was used to identify current marijuana abusers in the National Survey on Drug Use and Health survey data.

Results: A total of 22,411 participants were identified as marijuana abusers (male = 10,619; female = 11,792). The 2,651 variables were included in the Minimum Redundancy Maximum Relevance feature selection process. A prediction model built with 1% of the entire variables (n=27) showed best performance (ROC= 0.9617) in a male group and one built with 1.2% of the variables (n=32) showed the best performance (ROC= 0.9600) in a female group. Both groups used multiple substances besides marijuana. A male group showed use of substances that boost pleasure or relieve pain while a female group showed use of anxiety-relieving substances.

Conclusion: The best performing marijuana addiction prediction models and identified risk variables by gender could be used for developing new effective marijuana prevention, treatment, and rehabilitation programs for men and women who overuse this substance.  


Marijuana is the most commonly used addictive substance in the United States (Miller et al., 2017). The rate of marijuana users has increased in young adults and pregnant women in recent decades ( National Institute on Drug Abuse (NIDA), 2021b). The current climate of legalizing and decriminalizing marijuana use would contribute to further increments (Krauss et al., 2017; Pearson et al., 2017). Studies indicate that marijuana use is associated with various psychosocial and medical problems in young adults (Di Forti et al., 2019; NIDA, 2019), in addition to lower birth weight and harmful impacts on fetal brain development in pregnant women (Gunn et al., 2016). Despite the urgent need for treatments for this youth group, little information is available about the critical determinants leading to marijuana use.     

Researchers indicate there are significant gender differences in marijuana addiction related symptoms. In a study by Assari and colleagues, results showed that male marijuana users were at greater risk of developing depressive symptoms than female users (Assari et al., 2018). A survey study found that female participants reported more marijuana withdrawal symptoms than male participants. Besides, male participants in this study indicated that they were likely to buy marijuana from dealers while female participants were likely to receive it as a gift, implying behavioral differences (Ho, 2020). The National Institute on Drug Abuse report revealed a wide gender gap in the prevalence of marijuana users (Carliner et al., 2017; NIDA, 2021a). The National Institute on Drug Abuse (2020) also showed that men have higher dependence on all types of drugs, including marijuana, than women. The report suggests that men with marijuana addiction often have other substance use disorders and antisocial personality disorders, while women experience panic attacks and anxiety disorders (NIDA, 2020). To develop marijuana addiction treatment and rehabilitation programs tailored to meet gender-specific needs, building gender-specific prediction models becomes necessary. 

The purpose of this study was to build marijuana addiction prediction models based on risk variables by gender for young adults (18-34 years) using Random Forest (RF), a machine learning algorithm. RF was selected because it has a proven record of building highly accurate prediction models across many diverse datasets (Mun & Geng, 2019; Na et al., 2018; Samad et al., 2019; Wongvibulsin et al., 2019). Data were drawn from the NSDUH and Minimum Redundancy Maximum Relevance feature (variable) selection was used to identify risk variables. Substance-related variables in the prediction models of each male and female group were analyzed to find whether any differences exist.


Machine Learning

Due to advanced technology, a large volume of data has been generated and accumulated for every field and specialty. These data are generated in/from different formats and modalities, including digits, text, videos or signals from wearable devices. The fundamentally different characteristics of data pose challenges for analysis. Machine learning techniques are popular since they have the capacity to analyze large and complex sets of data (Blum & Langley, 1997; Cai et al., 2018). Prediction models built with significant predictors using machine learning became useful tools in health care since they can be used with new patients’ data to predict patient outcomes or prognoses (Krumholz, 2014; Symons et al., 2019).

National Survey on Drug Use and Health (NSDUH)

The Substance Abuse and Mental Health Services Administration (SAMHSA) and Center for Behavioral Health Statistics and Quality (CBHSQ) jointly conducted the NSDUH (2016). This was the 36th of a series to measure the prevalence and correlates of illicit drug, alcohol and tobacco use. The target population was citizens of the U.S., including those living on military bases, who were 12 years of age or older at the time of the survey. Data were collected from 50 states and the District of Columbia using an audio computer-assisted self-interviewing method. The data is publicly available and had already undergone a confidentiality review. They were altered when necessary to reduce the risk of disclosure (USDHHS, 2017).   

Study Design

The design was a data mining study using a machine learning approach. Random Forest and Minimum Redundancy Maximum Relevance were applied to survey data to build prediction models and identify risk variables by gender.

Study Sample

The NSDUH (2016) survey data for 18 to 34 years of age (n=56,897) comprised the sample and was defined as young adults for the purpose of this study.

Defining Labels

To identify “current marijuana abuser”, the Diagnostic and Statistical Manual of Mental Disorders, 5th edition (DSM-5) (American Psychiatric Association, 2013) and the codebook (USDHHS, 2017) published by SAMHSA and CBHSQ were used. Eleven survey questions were identified and validated by an opioid expert with more than 20 years of clinical experience in identifying a problematic pattern of marijuana use. When a participant indicated a yes response of at least two to those survey questions (Table 1), they were categorized as a “current marijuana abuser”. The rationale for this labeling was based on DSM-5’s diagnostic criteria of marijuana use disorder. Table 1 shows partial examples of survey questions identifying a problematic pattern of marijuana use, which were mapped with DSM-5 diagnostic criteria by an opioid expert. 

Table 1: Partial examples of survey questions mapped with DSM-5 diagnostic criteria

Data Preprocessing

The total number of records in the data was 56,897 with 2,662 variables. Eleven survey questions (variables) used to identify marijuana abusers (labels) were removed from the data. The total number of variables included in the computation was 2,651. Data in males aged 18-34 years old and females of the same age range were 10,619 and 11,792 respectively. 

Imputation for Missing Data

A mode imputation method was used to assign missing data in this study. Although this method is simple, studies indicate this method generates a similar level of performance compared to more complex imputation methods, such as expectation-maximization and hot-deck imputation (Gmel, 2001; Neurons, 2018). Some 14,183,342 missing responses were identified in the dataset, while expected responses were 151,459,814 (56,897 dataset x 2,662 variables). Hence, the missing data rate was 9.36%.

Feature (Variable) Ranking and Selection

Since NSDUH data include a mix of continuous and categorical values, the Minimum Redundancy Maximum Relevance (MRMR) (Ding & Preng, 2017) method was used for the ranking and selection of variables. MRMR increases the representative power of variables by selecting those that are maximally dissimilar to each other, thereby maximizing the variables’ maximum relevance condition. With this method, small yet more representative variables could be selected. To identify minimum sets of variables that can produce optimal performance, the stepwise variable selection was performed. Performance of each model with selected variables was measured for comparison.

Stepwise Variable Selection

Representative variables are needed from a large set of variables, especially in healthcare. Although significantly smaller in size than the original variables, they allow clinicians to observe variables thoroughly and to infer important relationships with independent variables (labels) because they capture most of the information from the original variables and have low redundancy (Borovicka et al., 2012; Feng Pan et al., 2005; Vabalas et al., 2019). The initial analysis was conducted to identify the approximate number of variables that yield the optimal performance as representative variables. First, the performance of the prediction model with the best 10% of the entire variables (n=2,651) ranked by the MRMR was measured. Next, the performance of the prediction model with the best 20% of variables was measured. This process was repeated until 100% of the variables were included in the measurement.

The subsequent analysis was conducted to identify the number of variables that yield the optimal performance more accurately. The performance of the prediction model with the best 1% of the entire variables ranked by the MRMR was measured, next with the best 2%, and repeated until the best 10% were included. The model with the best 1% of variables performed the best in this round. Further feature selection was performed to observe the possibility of fewer variables with optimal performance of the prediction model. The performance of the prediction model with 0.5% of the entire variables ranked by the MRMR was measured and repeated until 1.4% of variables were included.       

Applying Random Forest Algorithm

Random Forest is an ensemble algorithm that has a computational efficiency over large datasets. This algorithm randomly selects a subset of variables and constructs many decision trees. Strengths of random forest are low bias, high variance, and low correlation between constructed trees (Chen et al., 2017; Kesler et al., 2017; Oshiro et al., 2012). MATLAB 2019A (Natick, MA) was used for the training models. Sixty-four decision trees were set in Random Forest, which was recommended in a study as the optimal number of trees (Oshiro et al., 2012) and 10-fold cross validation for evaluating performance. 


Table 2 summarizes the demographic breakdown of the 22,411 respondents. About 56% of a male group and 55% of a female group reported they were non-Hispanic, white. About 20% of males and females identified as Hispanic. About 65% of respondents in each group graduated from high school and had some college associate degree. 

Table 2: Demographics of respondents identified as marijuana abuser (18-34 years old)

The 2,651 variables were included in the Minimum Redundancy Maximum Relevance (MRMR) feature selection process. Five performance measures were used: sensitivity, specificity, positive predictive value (PPV), negative predictive value (NPV), and receiver operating characteristic (ROC). The best performance of prediction models was found in the models with the best 1% of the entire variables (n=2,651) and the numbers of variables included were 27. Table 3 shows the partial lists of the highest-ranked variables by MRMR in male and female groups. 

Table 3: Partial list of highest ranked variables by Minimum Redundancy Maximum Relevance (MRMR) in male and female groups

Prediction models were built with less than 1% of the entire variables and their performances were measured. Table 4 shows the performance of the prediction models starting with 0.5% of 2,651 variables. A model that was trained using the best 1% of 2,651 variables shows the highest performance (ROC=0.9617) in a male group. Models trained using the best 1.2% and 1.3% of variables show the same highest performance (ROC=0.9600) in a female group. A model with the best 1.2% of variables was considered as the best in this study since it had fewer minimum data than a model with 1.3% variables for a female group. The numbers of variables included in this group were 32.  

Table 4: Performance of the prediction models built based on identified risk variables by Minimum Redundancy Maximum Relevance (MRMR) feature (variable) selection with 2,651 variables

Substance-related variables included in best-performing prediction models were extracted in each male and female group. Table 5 shows risk variables related to substances.  

Table 5: Risk variables related to substance use that are different in male and female groups


The study successfully illustrated finding the best performing marijuana addiction prediction models by gender in young adults (18-34 years) using Random Forest and the Minimum Redundancy Maximum Relevance (MRMR) feature selection. Results show the prediction models included different risk variables for a male and female group.

Both groups indicated use of the Demerol products in any past years, but a male group indicated “Demerol products use” while a female group indicated that as “Demerol products misuse.” It implies that behavior toward pain reliever addiction is different in a male and female group and influences marijuana addiction. Another behavioral difference toward drug misuse was that males indicated first misuse prior to age of 21 to seek pain relief or stimulation whereas females indicated they misused for sedation (Table 3). Risk variables in Table 5 reflect types of substances use and misuse differ between male and female groups. Males were often addicted to powerful hallucinogenic drugs and females to tranquilizers. 

Machine learning applications have been introduced to healthcare to facilitate patient care. Prediction models built by machine learning became a valuable tool for care providers because diagnosis, care plans, and patient risks classifications based on prediction models explain the reasons behind models to them (Ahmad et al., 2018; Wongvibulsin et al., 2019). Prediction models developed in this study enable care providers in substance and drug abuse specialties to provide the best and customized care by gender. For example, they can develop educational materials focused more on pain relievers' side effects and alternative pain relief methods for men while emphasizing the harmful impacts of anxiety and relieving methods for women.  


Although the National Survey on Drug Use and Health (NSDUS) data are useful, some limitations exist. Data are based on self-reports of drug use and dependencies. The validity of data is contingent upon respondents’ level of honesty and reliable memory. The degree of underreporting and over-reporting of information was unknown. This survey is cross-sectional measuring responses at a single snapshot in time, hence overlooking the development of abuse and dependency behaviors as they develop over time.

Although the excluded percentage of the population is only 3%, their inclusion could generate different results. If other feature selections and prediction model algorithms were used, there may be different risk variables showing different optimal performance. If more than one opioid expert validated identified risk variables, study results may be different. 


Researchers indicate there are significant gender differences in substance abuse behavior. This research focused on marijuana abuse and successfully built marijuana addiction prediction models while finding differences in risk variables. The results of this study suggest different risk variables exist by gender in young adults (18-34 years old) who abuse marijuana. Effective prediction models by Random Forest and Minimum Redundancy Maximum Relevance (MRMR) and different risk variables found in each male and female group could form the basis of tailored marijuana and substance treatment and rehabilitation programs for men and women. For example, care providers can develop separate online peer support web portal sites using an asynchronous chatroom for men and women. Video education materials showing how substance abuse affects pregnancy could be a practical education or prevention tool aimed at women in treatment and rehabilitation programs. Further, the findings may better inform policymakers in primary school education when developing early prevention activities targeting at risk youth.    

Online Journal of Nursing Informatics

Powered by the HIMSS Foundation and the HIMSS Nursing Informatics Community, the Online Journal of Nursing Informatics is a free, international, peer reviewed publication that is published three times a year and supports all functional areas of nursing informatics.

Read the Latest Edition


Authors acknowledge support for this study by the University of North Carolina, School of Nursing, Irwin Belk Professor Faculty Mentoring Funds.


Ahmad, M., Eckert, C., Teredesai, A., & McKelvey, G. (2018). Interpretable Machine Learning in Healthcare. IEEE Intelligent Informatics Bulletin, 19(1), 1-7.

American Psychiatric Association. (2013). Diagnostic and Statistical Manual of Mental Disorders (DSM-5®), Fifth Edition. APA.

Assari, S., Mistry, R., Caldwell, C.H., & Zimmerman, M.A. (2018). Marijuana Use and Depressive Symptoms; Gender Differences in African American Adolescents. Frontiers in Psychology, 9, 2135.

Blum, A.L., & Langley, P. (1997). Selection of relevant features and examples in machine learning. Artificial Intelligence, 97, 245-271.

Borovicka, T., Jirina, M., Kordik, P., & Jiri, M. (2012). Selecting Representative Data Sets. In Advances in Data Mining Knowledge Discovery and Applications.

Cai, J., Luo, J., Wang, S., & Yang, S. (2018). Feature selection in machine learning: A new perspective. Neurocomputing, 300, 70-79.

Carliner, H., Mauro, P.M., Brown, Q.L., Shmulewitz, D., Rahim-Juwel, R., Sarvet, A.L., Wall, M.M., Martins, S.S., Carliner, G., & Hasin, D.S. (2017). The widening gender gap in marijuana use prevalence in the U.S. during a period of economic change, 2002-2014. Drug Alcohol Dependency, 170, 51-58.

Chen, W., Xie, X., Wang, J., Pradhan, B., Hong, H., Bui, D.T., Duan, Z., & Ma, J. (2017). A comparative study of logistic model tree, random forest, and classification and regression tree models for spatial prediction of landslide susceptibility. Catena, 151, 147-160.

Di Forti, M., Quattrone, D., Freeman, T.P., Tripoli, G., Gayer-Anderson, C., Quigley, H., Rodriguez, V., Jongsma, H.E., Ferraro, L., La Cascia, C., La Barbera, D., Tarricone, I., Berardi, D., Szöke, A., Arango, C., Tortelli, A., Velthorst, E., Bernardo, M., Del-Ben, C. M....van der Ven, E. (2019). The contribution of cannabis use to variation in the incidence of psychotic disorder across Europe (EU-GEI): a multicentre case-control study. The Lancet Psychiatry, 6(5), 427-436.

Ding, C., & Preng, H. (2017). Minimum Redundancy Feature Selection from Microarray Gene Expression Data Journal of Bioinformatics and Computational Biology, 3(2), 185-205.

Feng Pan, W.W., Tung, A.K., & Yang, J. (2005). Finding Representative Set from Massive Data Fifth IEEE International Conference on Data Mining (ICDM'05), Houston, TX.

Gmel, G. (2001). Imputation of missing values in the case of a multiple item instrument measuring alcohol consumption.Statistics in Medicine, 20(15), 2369-2381.

Gunn, J., Rosales, C., Center, K., Núñez, A., Gibson, S., Christ, C., & Ehiri, J. (2016). Prenatal exposure to cannabis and maternal and child health outcomes: A systematic review and meta-analysis BMJ Open,, 6(4), e009986.

Ho, J. (2020). Gender Differences in Marijuana-related Problems among those who Self-reported Daily Smoking and Marijuana Use. Wright State University.

Kesler, S.R., Rao, A., Blayney, D.W., Oakley-Girvan, I.A., Karuturi, M., & Palesh, O. (2017). Predicting Long-Term Cognitive Outcome Following Breast Cancer with Pre-Treatment Resting State fMRI and Random Forest Machine Learning. Frontiers in Human Neuroscience, 11, 555.

Krauss, M.J., Rajbhandari, B., Sowles, S.J., Spitznagel, E.L., & Cavazos-Rehg, P. (2017). A latent class analysis of poly-marijuana use among young adults. Addictive Behavior, 75, 159-165.

Krumholz, H.M. (2014). Big data and new knowledge in medicine: the thinking, training, and tools needed for a learning health systems Health Affairs, 33(7), 1163-1170.

Miller, N.S., Oberbarnscheidt, T., & Gold, M.S. (2017). Marijuana Addictive Disorders and DSM-5 Substance-Related Disorders. Journal of Addiction Research & Therapy, S11.

Mun, E., & Geng, F. (2019). Predicting post-experiment fatigue among healthy young adults: Random forest regression analysis. Psychological Test and Assessment Modeling, 61(4), 471-493.

Na, L., Yang, C., Lo, C.C., Zhao, F., Fukuoka, Y., & Aswani, A. (2018). Feasibility of Reidentifying Individuals in Large National Physical Activity Data Sets From Which Protected Health Information Has Been Removed With Use of Machine Learning. JAMA Network Open, 1(8), e186040.

Neurons, T. (2018). Comparison between Denoising Autoencoders and Random Forest for Imputation of Mixed Data from Electronic Medical Records University of California Los Angeles.

National Institute on Drug Abuse (NIDA). (2019). Marijuana DrugFacts.

National Institute on Drug Abuse (NIDA). (2020). Sex and Gender Differences in Substance UseSubstance Use in Women Research Report.

National Institute on Drug Abuse (NIDA). (2021a). Sex and Gender Differences in Substance Use. Retrieved 2021, June 16, from

National Institute on Drug Abuse (NIDA). (2021b). What is the scope of marijuana use in the United States? Retrieved June 16, 2021, from

Oshiro, T., Santoro, P., & Baranauskas, J. (2012). How Many Trees in a Random Forest? In Perner (Ed.), Machine Learning and Data Mining in Pattern Recognition. (Vol. 7376). Springer, Berlin, Heidelberg.

Pearson, M.R., Liese, B.S., Dvorak, R.D., & Marijuana Outcomes Study Team. (2017). College student marijuana involvement: Perceptions, use, and consequences across 11 college campuses. Addictive Behavior, 66, 83-89.

Samad, M.D., Ulloa, A., Wehner, G.J., Jing, L., Hartzel, D., Good, C.W., Williams, B.A., Haggerty, C.M., & Fornwalt, B.K. (2019). Predicting Survival From Large Echocardiography and Electronic Health Record Datasets: Optimization With Machine Learning. JACC Cardiovascular Imaging, 12(4), 681-689.

Symons, M., Feeney, G.F.X., Gallagher, M.R., Young, R.M., & Connor, J.P. (2019). Machine learning vs addiction therapists: A pilot study predicting alcohol dependence treatment outcome from patient data in behavior therapy with adjunctive medication. J Subst Abuse Treat, 99, 156-162.

USDHHS. (2017). 2016 National Survey on Drug Use and Health Public Use File Codebook.

Vabalas, A., Gowen, E., Poliakoff, E., & Casson, A.J. (2019). Machine learning algorithm validation with a limited sample size. PLoS One, 14(11), e0224365.

Wongvibulsin, S., Wu, K.C., & Zeger, S.L. (2019). Clinical risk prediction with random forests for survival, longitudinal, and multivariate (RF-SLAM) data analysis. BMC Medical Research Methodology, 20(1), 1.

Jeeyae Choi, PhD, RN, is an Associate Professor at the School of Nursing, University of North Carolina Wilmington, whose background is in nursing and computer systems engineering. Dr. Choi has designed and implemented multiple web-based decision support systems (DSSs) and mobile apps throughout 13 years of her career. While she is focusing on developing standalone applications on the web and mobile platforms, she also has concentrated on data science and data standards. It was a natural consequence because developed applications need to use accurate and sharable data.

Hee-Tae Jung, PhD, MS, is a postdoctoral research fellow at the College of Information and Computer Sciences, University of Massachusetts Amherst. Dr. Jung’s primary research focus is on improving the quality of contemporary rehabilitation service and takes a multidisciplinary approach that encompasses biomedical & health informatics, human-computer interaction, and clinical sciences. His research involves building prediction models based on big data sets with the goal of improving patients’ health outcomes.

Jeungok Choi, RN, PhD, MPH, is an Associate Professor at the University of Massachusetts, Amherst, College of Nursing. Her research interests focus on tablet-based intervention such as  Tab-CBI (a tablet-based cognitive behavioral intervention application to improve fatigue symptoms for older adults with rheumatoid arthritis) and ASSISTWell (a tablet-based application designed to promote older adults’ self-management of chronic health problems). Also, she focuses on building a prediction model using a machine learning approach and natural language processing.