Emerging Technologies

Development of a Breast Cancer Risk Assessment Model Using a Machine Learning Approach

Clinician reviewing medication with frequent adverse drug events


Background: The National Cancer Institute (NCI) Breast Cancer Risk Assessment Tool (BCRAT, also known as the Gail model) is the most widely available tool of its kind. Although the Gail model is well calibrated, it shows a low discriminatory accuracy.

Method: The purpose of this project was to develop breast cancer risk prediction models that outperform the Gail model using an innovative machine learning approach. Data from one breast cancer research center (n=393) were accessed and analyzed. Three supervised machine learning algorithms – logistic regression, J48 decision tree, and random forest with ten-fold cross validation – were used to build and train the models. To select a highly relevant set of variables, we used correlation- and wrapper-based feature selections. An expectation-maximization imputation method was used to replace missing data (8.28%). The sample was predominantly white (91.1%) and 45 years or older (83.2%).

Results: The six models were successfully built and demonstrated fair to good calibration (E/O=0.75 to 0.80) and discriminatory accuracy (AUROC= 0.70 to 0.79). The six models outperform the Gail model; AUROC=0.70- 0.79 for the machine learning models vs. 0.53-0.63 for the Gail model. The study findings may lead to the development of a new prediction tool for identifying women at high risk of breast cancer. Using a model with improved prediction accuracy, clinicians can tailor screening decisions and counsel behavioral changes (e.g., weight control, exercise, and moderating alcohol intake) to decrease risk.


Despite overall steady declines in death rate in the United States, breast cancer is still the second leading cause of cancer death among women according to the American Cancer Society (American Cancer Society [ACS], 2019). In 2019, there were about 268,600 new invasive cases and 41,760 deaths (ACS, 2019). The development of accurate breast cancer prediction models has become increasingly important to clinicians in identifying individuals at high risk of breast cancer. With accurate prediction information, clinicians can tailor screening decisions and counsel behavioral changes (e.g., weight control, exercise, and moderating alcohol intake) to decrease risk.

Multiple breast cancer risk prediction models have been developed. The NCI Breast Cancer Risk Assessment Tool (BCRAT, also known as the Gail model) (National Cancer Institute, n.d.) is the most widely available tool. Although the Gail model is well calibrated, it shows a low discriminatory accuracy (the ability of the model to determine and stratify women into those who do versus those who do not develop breast cancer). According to a recent meta-analysis study, the Gail model showed an excellent calibration score (expected-to-observed (E/O) ratio =0.98) but low discriminatory accuracy score (the area under the ROC curve [AUROC] =0.61) (Wang et al., 2018). When the model is applied to women aged 75 and older, the discriminatory accuracy is much lower with AUROC ranging from 0.56 to 0.58 (Schonberg et al., 2016), where AUROC=0.5 is equivalent to a coin toss. The low discriminatory accuracy of the model will generate too many false positive or false negative alerts to clinicians, resulting in unnecessary healthcare costs or threatening patients’ health outcomes (Steyerberg et al., 2010). An inaccurate model makes it difficult for clinicians to identify which patients are at risk, which in turn makes it difficult to provide individually tailored preventive care and counseling.  There is a need for models with better predictive performance, especially models with improved discrimination performance. Given the large amount of research already conducted, further improvement of existing models based on conventional regression models is perhaps unlikely. The purpose of this project was to develop breast cancer risk prediction models that outperform the Gail model using an innovative machine learning approach.

Machine Learning Approach

Data mining using machine learning algorithms is gaining popularity due to the ability to process and analyze large and complex data sets. Machine learning, a subset of artificial intelligence, permits a computer to automatically learn from given data and improve from experience without being explicitly programmed to carry out a certain task (e.g., building a prediction model).

The machine learning approach has many strengths over a traditional statistical approach, which up until now was used in developing prediction models including the Gail model. First, a machine learning approach can handle an enormous dataset with a large number of variables; this is not possible with a traditional statistical approach. Second, machine learning is flexible in handling complex healthcare phenomenon such as breast cancer risk prediction, which is often characterized by non-linear relationships between the variables and by complicated multiple interactions among variables, thereby challenging the assumptions of a traditional statistical approach. Last, the most beneficial feature of a machine learning approach is its ability to automatically select the most useful variables from a given set of variables and continuously ‘refine’ a prediction model through an iterative process until the best fitting-to-data model emerges. In the end, machine learning produces a model that best reflects the given dataset, resulting in a more accurate and reliable model than the statistics-based prediction models (Deo, 2015). The machine learning approach demonstrated significantly better predictive accuracy than the traditional statistical modeling approach in hepatocellular carcinoma prediction assessment (Singal et al., 2013), post-operative pain assessment (Tighe et al., 2015), emergency department visits (Zitnik et al., 2019), and fall risk assessment (Lee et al., 2011).


The data from one breast cancer research center were used in this analysis. After removing irrelevant variables (i.e., respondent ID, interview date, etc.), the data set consisted of 393 patients with a total of 129 variables. The variables included were baseline demographics (i.e., age, BMI, race/ethnicity, type of medications previously taken) and pathologic findings (i.e., biopsy findings). 

For this project, we chose three supervised machine learning algorithms: logistic regression, J48 decision tree, and random forest with ten-fold cross validation. Logistic regression (LR) is an algorithm that solves the binary classification problem. It predicts the probability of an event (developing invasive cancer in this study) by fitting the data to a logistic function. J48 decision tree (DT) is a hierarchical model composed of decision rules that recursively split independent variables into homogeneous zones based on most significant splitter in input variables. Random forest (RF) is an algorithm that develops, builds and then merges a group of classification trees to obtain a prediction of breast cancer incidence (Deo, 2015).  LR and RF were chosen because they were less prone to overfitting, while DT was able to handle non-linear interactions among the variables. Weka (Waikato Environment for Knowledge Analysis) 3.8, open source data mining software, was used to develop the models. To examine each model’s performance, we estimated calibration (E/O ratio) and discrimination accuracy (sensitivity, specificity, positive predictive value [PPV], negative predictive value [NPV] and AUROC) of the models (Simundic, 2009).

When developing a model, it is important to identify “a few most important” variables because fewer variables involved in the model construction reduces the complexity of the model, reduces training time, and facilitates data understanding and interpretation. To select a highly relevant set of variables, we used two feature selection techniques: correlation-based and wrapper-based feature selections (Guyon & Elisseeff, 2003). A correlation-based feature selection estimates the correlation of each variable to the outcome variable, ranks them by the score, and then decides to either keep or remove the variable from the dataset (Guyon & Elisseeff, 2003). A wrapper method evaluates subsets of variables which allows for the detection of possible interactions between the variables and assess subsets of variables according to their usefulness to a given predictor. 

There were 8.28% of missing data. An expectation-maximization (EM) method was used to replace the missing data. The EM imputationis an iterative procedure of imputing a value (expectation) and evaluating whether or not that value is likely (maximization) until the most likely value is found (Nelwamondo, Mohamed, & Marwala, 2007). Figure 1 shows the overview of building prediction models.

Figure 1: Overview of Building Prediction Models


Sample patient demographics are summarized in Table 1. The sample was predominantly white (91.1%) and 45 years or older (83.2%), with a high school diploma or higher degree education (93.9%).  Six prediction models were built using two feature selection techniques: correlation-based and wrapper-based feature selections.

Table 1: Sample Patient demographics

Table 2 reports the performance of each model.  The six models demonstrated calibration of 0.75 to 0.80 measured by expected/observed outcome ratio. The scores of sensitivity and specificity range from 0.75 to 0.80 and 0.67 to 0.71, respectfully. When using the correlation-based feature selection, the logistic regression, decision tree, and random forest algorithms achieved a discriminatory accuracy measured by AUROC of 0.70, 0.79 and 0.78 respectfully. Three models used seven variables in their model construction. These variables were history of needle or surgical breast biopsy; use of over-the-counter preparation for hormone replacement or post-menopausal symptoms; use of any oral or inhaled steroids; menstrual bleeding in the last three months; having any ancestors of Ashkenazi Jewish, Dutch, Icelandic, or French-Canadian descent; age at diagnosis; and blood relatives with breast cancer. In the wrapper-based feature selection, the logistic regression, decision tree, and random forest achieved a discriminatory accuracy measured by AUROC of 0.72, 0.73 and 0.77 respectfully. Each model used 10 variables in the model construction.  The variables were body weight, year at first period, history of hysterectomy, menstrual bleeding in the last three months, pregnancy that lasted for more than five months, history of needle or surgical breast biopsy, history of infertility treatment, current use of intrauterine device (IUD), age at diagnosis, and blood relatives with breast cancer.

Table 2: Performance of Six Prediction Models


The three machine learning algorithms – logistic regression, J48 decision tree, and random forest – built and trained the prediction models successfully. The models demonstrated a reasonably high calibration and could be considered reliable for prediction (E/O=74.6 to 79.6). The scores of sensitivity (0.75 to 0.80) and specificity (0.67 to 0.71) indicated fair to good discrimination performance (Safari, Baratloo, Elfil, & Negida, 2016).  The AUROC of each model varied from 0.70 to 0.79, indicating good discriminatory accuracy (Simundic, 2009). The six models outperformed the Gail model regarding discriminatory accuracy; AUROC=0.70- 0.79 for the six machine learning models vs. 0.53-0.63 for the Gail model (Meads, Ahmed, & Riley, 2012). Although there were subtle differences in AUROC measures in six models, the model built by decision tree with correlation-based feature selection produced the highest discriminatory accuracy (AUROC = 0.79).

There are two limitations of this study. First, the data used in this study were mined from one research center. Therefore, the six prediction models built in this study cannot be generalizable until they are validated further with other larger health data sets.  Second, the prediction models were built based on data from predominantly white women (91.1%), again producing an issue of generalizability. The models might not predict risks accurately in other races such as African American, for example. Further validation of the models with data sets having a diverse racial/ethnic mix is necessary.

Conclusions and Implications for Nursing

Overall, all six models showed fair to good performance in terms of calibration and discriminatory accuracy, implying that the models were well developed and trained. Furthermore, all six models improved the discriminatory accuracy compared to the traditional Gail model. 

The findings in this study may lead to the development of improved prediction tools for identifying women at high risk of invasive breast cancer. If such improved models are developed, these models can help clinician decision making so that they can better identify patients at risk of breast cancer and provide more effective individualized preventive care and counseling.  

Data from healthcare settings are often large and complex and require systematic and efficient analytic techniques to discover meaningful information and new knowledge (Brennan & Bakken, 2015). Machine learning techniques have been gaining popularity for data mining recently due to their ability to handle large and complex data. The findings of this study help to establish that data mining using machine learning approaches produces models with better predictive performance compared to traditional statistical approaches (e.g., logistic regression) used in developing current breast risk prediction models. The findings demonstrate that machine learning approaches are feasible and appropriate to use for nursing research to mine new knowledge and to improve nursing practice and patient outcomes.

The views and opinions expressed in this blog or by commenters are those of the author and do not necessarily reflect the official policy or position of HIMSS or its affiliates.

Online Journal of Nursing Informatics

Powered by the HIMSS Foundation and the HIMSS Nursing Informatics Community, the Online Journal of Nursing Informatics is a free, international, peer reviewed publication that is published three times a year and supports all functional areas of nursing informatics.

Read the Latest Edition

American Cancer Society (2019). Breast Cancer Facts & Figures 2019-2020. American Cancer Society, Inc.

Brennan, P., & Bakken, S. (2015). Nursing needs big data and big data needs nursing. Journal of Nursing Scholarship, 47(5), 477-484. doi:10.1111/jnu.12159

Delen, D., Walker, G., & Kadam, A. (2005). Predicting breast cancer survivability: A comparison of three data mining methods. Artificial Intelligence in Medicine, 34(2), 113-127. doi:10.1016/j.artmed.2004.07.002

Deo, R. (2015). Machine learning in medicine. Circulation, 132(20), 1920-30.

Guyon, I., & Elisseeff, A. (2003). An introduction to variable and feature selection. Journal of Machine Learning Research, 3, 1157-1182.

Lee, T., Liu, C., Kuo, Y., Mills, M., Fong, J., & Hung, C. (2011). Application of data mining to the identification of critical factors in patient falls using a web-based reporting system. International Journal of Medical Informatics, 80(2), 141-150. doi:10.1016/j.ijmedinf.2010.10.009

Meads, C., Ahmed, I., & Riley, R. (2012). A systematic review of breast cancer incidence risk prediction models with meta-analysis of their performance. Breast Cancer Research and Treatment, 132(2), 365-377. doi:10.1007/s10549-011-1818-2

National Cancer Institute (n.d.). Breast cancer risk assessment tool.

Nelwamondo, F. V., Mohamed, S., & Marwala, T. (2007). Missing data: A comparison of neural network and expectation maximization techniques. Current Science, 93(11), 1514-1521.

Safari, S., Baratloo, A., Elfil, M., & Negida, A. (2016). Evidence based emergency medicine; part 5 receiver operating curve and area under the curve. Emergency, 4(2), 111-113.

Schonberg, M. A., Li, V. W., Eliassen, A. H., Davis, R. B., LaCroix, A. Z., McCarthy, E. P., Rosner, B. A., Chlebowski, R. T., Rohan, T. E., Hankinson, S. E., Marcantonio, E. R., & Ngo, L. H. (2016). Performance of the Breast Cancer Risk Assessment Tool Among Women Age 75 Years and Older. JNCI: Journal of the National Cancer Institute108(3), 1–11. doi:10.1093/jnci/djv348

Simundic, A. (2009). Measures of diagnostic accuracy: Basic definitions. eJIFCC, 19(4), 203-211.

Singal, A. G., Mukherjee, A., Elmunzer, B. J., Higgins, P. D. R., Lok, A. S., Zhu, J., Marrero, J. A., & Waljee, A. K. (2013). Machine learning algorithms outperform conventional regression models in predicting development of hepatocellular carcinoma. The American Journal of Gastroenterology108(11), 1723–1730. 

Steyerberg, E. W., Vickers, A. J., Cook, N. R., Gerds, T., Gonen, M., Obuchowski, N., Pencina, M. J., & Kattan, M. W. (2010). Assessing the performance of prediction models: a framework for traditional and novel measures. Epidemiology (Cambridge, Mass.)21(1), 128–138. doi:10.1097/EDE.0b013e3181c30fb2

Tighe, P., Harle, C., Hurley, R., Aytug, H., Boezaart, A., & Fillingim, R. (2015). Teaching a machine to feel postoperative pain: Combining high-dimensional clinical data with machine learning algorithms to forecast acute postoperative pain. Pain Medicine, 16(7), 1386-1401. doi:10.1111/pme.12713

Wang, X., Huang, Y., Li, L., Dai, H., Song, F., & Chen, K. (2018). Assessment of performance of the gail model for predicting breast cancer risk: A systematic review and meta-analysis with trial sequential analysis. Breast Cancer Research, 20(1), 18. doi:10.1186/s13058-018-0947-5

Zitnik, M., Nguyen, F., Wang, B., Leskovec, J., Goldenberg, A., & Hoffman, M. (2019). Machine learning for integrating data in biology and medicine: Principles, practice, and opportunities. Information Fusion, 50, 71-91. doi:10.1016/j.inffus.2018.09.012

Author Bios:
Jeungok Choi, RN, PhD, MPH, is an Associate Professor at the University of Massachusetts, Amherst, College of Nursing. Her research interests focus on tablet-based intervention to improve chronic conditions such as fatigue for older adults with arthritis and on building a prediction model using a machine learning approach and natural language processing.

Hee-Tae Jung, PhD, MS, is a Postdoctoral research associate, College of Information and Computer Science, University of Massachusetts, Amherst. He takes a holistic, end-to-end research approach with an essential goal of maximizing the overall quality of the rehabilitation service through technology-assisted therapies. Toward that end, he specifically focuses on three paths. 1) He investigates various ways to use artificial intelligence to rehabilitate patients and monitor their impairments/progress. 2) In a human-computer interaction perspective, he attempts to analyze the interaction dynamics of users during technology-assisted therapies in a real-world setting and to understand the room to advance the contemporary state-of-the-art rehabilitation technology. 3) He conducts clinical studies to understand the therapeutic benefit of technology-assisted therapies in routine clinical settings.

Woo Jung Choi, BA, MPH, is a statistician. He graduated with a Master of Public Health degree from the Boston University School of Public Health. In pursuing this line of work, he is working as an administrative statistician consultant for the Transform Alliance for Health clinic in Newton, Mass. His research goal focuses on improving healthcare outcomes using a big data analysis such as the machine learning approach.