Machine learning (ML) can be applied for medical decision support in clinical settings. In this post, I demonstrated the workflow of training ML models for pediatric pneumonia patient risk evaluation using randomly generated electronic health records (EHRs). I started with data cleansing and exploratory analysis on physician-identified biological indices, such as pathogens, lab data and vital signs. The goal train model to classify patients into 2 groups: high-risk patients with the need for intensive care and low-risk patients. Various ML models, such as logistic regression, boosted trees and feedforward neural networks were applied for the classification task. The workflow was applied on real-world EHRs at National Taiwan University Hospital and achieved an AuROC of 0.99. The detailed methodologies and results were published in JMIR Medical Informatics (Liu et al., 2022).
Quick Summary 🔗
Computer-aided evaluation of patient severity can improve the quality of healthcare in hospitals. This project demonstrates the workflow for building ML models that evaluate the need for intensive care within the first day of admission using randomly generated Electronic Health Records (EHRs). A quick project walkthrough can be found here. Detailed below are the key steps of model buildup workflow. My paper of “Evaluation of the need for intensive care in children with pneumonia: machine learning approach” also followed the workflow (Liu et al., 2022, with more features fine tuned, see reference).
Step 1: Create Dataset 🔗
To demonstrate workflow, randomly generated EHRs were used. Items related to creating a disease-specific severity evaluation model, such as age, sex, diagnosis code, vital signs, lab exams, etc., were included. The generated data imitates the format of real-world EHRs extracted from health information systems.
Step 2: Select Cohort with Specific Disease 🔗
Since the condition of patients with different diseases can be reflected by different physical indices, the first step of building an alarm system is to select a cohort for model building. Here, we select the cohort by their ICD (International Statistical Classification of Diseases and Related Health Problems) code.
An ICD code is a detailed classification of existing diseases that evolves through time. Thus, multiple ICD codes can be grouped as one disease. For example, ICD codes 485 (bronchopneumonia, organism unspecified) and 486 (pneumonia, organism unspecified) can both be considered pneumonia-related codes. As time passes, new ICD versions will replace the older version. 485 and 486 are codes from ICD-9; the corresponding codes in ICD-10 are J18.0 (Bronchopneumonia, unspecified organism), J18.8 (Other pneumonia, unspecified organism), and J18.9 (Pneumonia, unspecified organism). As demonstrated by their respective codes, similar diseases have similar ICD codes. In ICD-10, codes beginning with J1 are all diseases related to pneumonia or respiratory tract infections; codes that begin with J18 are different forms of “pneumonia, unspecified organism.”
Step 3: Data Preprocessing and Feature Engineering 🔗
a. Comorbid Conditions:
Comorbid conditions can impact prognosis. For example, pneumonia patients with cardiovascular diseases are at higher risk of ICU admission. Thus, we can group the diagnosis codes of a number of comorbid conditions (i.e., if a patient has any of the diagnosis codes A, B, or C, he/she is considered to have Comorbid Condition 1). In the demonstration, five comorbid conditions were included in the model as features (with/without the comorbid condition).
b. Pathogen:
For infectious diseases, pathogens are related to the evaluation of severity. For example, bacterial pneumonia tends to be more severe in children than viral pneumonia. In this demonstration, regular expression was used to extract positive pathogen exam findings and add those findings to the ML model as one feature (with/without positive findings from the pathogen).
c. Lab Data:
Various biochemical tests conducted in labs can reveal patients’ general body functions (e.g., white blood cell count, C-Reactive Protein, etc.). To evaluate the need for intensive care of inpatients on the first day, initial records of lab data were included as features for machine learning models.
d. Vital signs:
Vital signs, such as temperature, pulse, and blood pressure, are important indicators of patient condition. In this demo, minimum systolic and diastolic pressure, maximum temperature, and pulse were recorded within the first day of admission and included as features in the model to predict ICU admission.
Step 4: Model Selection and Hyperparameter Tuning 🔗
In outcome prediction tasks where few features are of fine temporal resolution, tree-based methods, such as XGboost and random forest, can yield good performances. In these models, maximum depth of trees, learning rate, and regularization are some of the hyperparameters that can further enhance model prediction power. ML models can outperform existing indices and severity scores that clinicians currently use, which come from previous studies using logistic regression models. In the demo, six machine learning models were chosen for the prediction task: Boosted Trees, random forest, feedforward neural network, logistic regression, Support Vector Machines (SVM) and K-nearest neighbors (KNN).
Reference 🔗
Liu Y-C, Cheng H-Y, Chang T-H, Ho T-W, Liu T-C, Yen T-Y, et al. Evaluation of the need for intensive care in children with pneumonia: A machine learning 2.approach. JMIR Med Inform. 2022;10(1):e28934. doi:10.2196/28934