Improving Lung Cancer Risk Prediction: Integration of Novel Predictors and Modelling Using Machine Learning Random Forest versus the Validated PLCOm2012 Logistic Regression Model

Date

Authors

Malik, Aleeza

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

Lung cancer (LC) is the leading cause of cancer-related deaths among men, and the second most common cause of cancer death among women, worldwide. Symptoms of LC appear when the disease has progressed to an advanced stage when curative treatments are ineffective, leading to poor prognosis. LC screening using low-dose computed tomography is shown to be effective for early detection of LC to reduce LC mortality. The goal of this study was to develop a superior LC risk prediction model compared to the current established Prostate, Lung, Colorectal, Ovarian Cancer Screening Trial 2012 model (PLCOm2012) for selection of high-risk individuals for LC screening. Development of the risk models was done using data from the Prostate, Lung, Colorectal, Ovarian (PLCO) Cancer Screening Trial control arm (n=43,217) and validated using the PLCO intervention arm (n=42,493). Logistic regression (LR), and random forest (RF) methodology were analyzed using R software. The models were evaluated based on their ability to predict 6-year LC risk and assessed using predictive performance measures including discrimination and calibration. Results of the current study indicated a superior predictive performance of the PLCOm2012 LR model compared to the risk model developed using RF, with area under the receiving-operating characteristic curve (ROC-AUC) of 0.797 and 0.775 (p<0.001), respectively. The addition of supplemental β-carotene, dietary vitamin A, total isoflavone, and history of chest x-ray also resulted in an increase in ROC-AUC from 0.797 to 0.810 (ΔROC-AUC= 0.013, p<0.001). This study demonstrated that the application of traditional LR exhibited superior predictive performance in comparison to the advanced machine learning RF technique. Moreover, the incorporation of dietary variables and history of chest x-ray improved the predictive performance of the current standard PLCOm2012 model.

Description

Citation