Generic selectors
Exact matches only
Search in title
Search in content
Post Type Selectors
Search in posts
Search in pages
Filter by Categories
Abstract
Abstracts
Brief Report
Case Report
Case Report and Review
Case Series
Commentary
Editorial
Erratum
How do I do it
How I do it?
Invited Editorial
Letter to Editor
Letter to the Editor
Letters to Editor
Letters to the Editor
Media & News
Mini Review
Original Article
Original Articles
Others
Point of View
Review Article
Short communication
Short Paper
Generic selectors
Exact matches only
Search in title
Search in content
Post Type Selectors
Search in posts
Search in pages
Filter by Categories
Abstract
Abstracts
Brief Report
Case Report
Case Report and Review
Case Series
Commentary
Editorial
Erratum
How do I do it
How I do it?
Invited Editorial
Letter to Editor
Letter to the Editor
Letters to Editor
Letters to the Editor
Media & News
Mini Review
Original Article
Original Articles
Others
Point of View
Review Article
Short communication
Short Paper
View/Download PDF

Translate this page into:

Original Article
18 (
1
); 60-69
doi:
10.25259/JLP_44_2025

A machine learning approach for predicting mortality in unvaccinated COVID-19 patients using a novel web application

Department of Pharmacology, All India Institute of Medical Sciences, Kalyani, West Bengal, India
Department of Laboratory Medicine, Delhi State Cancer Institute, New Delhi, India
Department of Laboratory Medicine, All India Institute of Medical Sciences, New Delhi, India
Department of Biochemistry, All India Institute of Medical Sciences, Kalyani, West Bengal, India

*Corresponding author: Sudip Kumar Datta, Department of Laboratory Medicine, All India Institute of Medical Sciences, New Delhi, India. dr.sudipdatta@gmail.com

Licence
This is an open-access article distributed under the terms of the Creative Commons Attribution-Non Commercial-Share Alike 4.0 License, which allows others to remix, transform, and build upon the work non-commercially, as long as the author is credited and the new creations are licensed under the identical terms.

How to cite this article: Mukhopadhyay K, Chopra P, Datta SK, Goswami K. A machine learning approach for predicting mortality in unvaccinated COVID-19 patients using a novel web application. J Lab Physicians. 2026;18:60-9. doi: 10.25259/JLP_44_2025

Abstract

Objectives:

During a medical emergency of epidemic proportions, triaging and resource allocation pose a significant challenge. In COVID-19, early identification of high-risk patients was crucial. Several studies have shown that machine learning (ML) enhances predictive accuracy in these patients. Using data from unvaccinated COVID-19 patients, we developed and validated ML models to predict mortality and compared them with traditional statistical methods used in our previous study. Finally, we present a novel web application to predict mortality.

Materials and Methods:

We conducted a retrospective study of 401 COVID-19 patients admitted between July and December 2020. ML models, including support vector machine (SVM), Random Forest, and XGBoost, were developed using demographic, clinical, and laboratory data.

Statistical analysis:

Models were evaluated using metrics such as accuracy, sensitivity, specificity, and area under the receiver operating characteristic curve (AUC). Feature reduction was performed to enhance clinical relevance. A web application was developed for real-time mortality risk predictions.

Results:

The SVM model performed the best. After feature reduction to five key predictors (clinical severity, IL-6, lactate dehydrogenase- LDH, neutrophil percentage, and neutrophil-to-lymphocyte ratio- NLR), the AUC improved from 0.82 to 0.83, with a sensitivity of 0.83 and specificity of 0.78. This outperformed our previous study results, where individual AUCs ranged from 0.614 to 0.710. The final model achieved a positive predictive value of 0.63 and a negative predictive value of 0.91, indicating high reliability in identifying both high-risk patients and those likely to survive.

Conclusions:

ML models, particularly SVM, significantly improved COVID-19 mortality prediction. This approach could enhance early risk identification and resource allocation in clinical settings.

Keywords

COVID-19
Hematological markers
Inflammatory markers
Machine learning
Mortality prediction

INTRODUCTION

The COVID-19 pandemic profoundly impacted global health, leading to millions of infections and fatalities since its emergence in late 2019. The causative virus, severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), presented particularly severe consequences for unvaccinated individuals, who faced a heightened risk of severe illness and mortality. In unvaccinated populations, the absence of immunity resulted in increased susceptibility to complications such as acute respiratory distress syndrome, multi-organ failure, and death.

Our team previously conducted a retrospective observational analysis of 401 COVID-19-positive patients admitted between July and December 2020, where we meticulously collected clinical details and laboratory investigations within 3 days of detecting COVID-19 positivity and documented patient outcomes from medical records.[1] Our findings revealed significant alterations in total leukocyte count (TLC), absolute neutrophil count (ANC), neutrophil-tolymphocyte ratio (NLR), C-reactive protein (CRP), IL-6, lactate dehydrogenase (LDH), ferritin, and lymphocyte-toCRP ratio (LCR) at presentation in severe cases compared to non-severe cases and in deceased compared to surviving patients. Notably, a combination of CRP, IL-6, ferritin, and LCR robustly predicted mortality, even in patients initially presenting with non-severe disease.

Building on these insights, this study aims to enhance the predictive accuracy and clinical utility of our model through the application of machine learning (ML) algorithms. ML, a subset of artificial intelligence, entails the development of algorithms capable of learning from and making predictions based on data.[2] In the healthcare context, ML models can analyze complex datasets to identify patterns and predict outcomes, such as disease progression and patient mortality. Applying ML to COVID-19 mortality prediction involves training algorithms on clinical and biochemical data to discern the most indicative factors of severe outcomes. This approach promises to enhance early intervention strategies and optimize the allocation of medical resources by identifying high-risk patients requiring intensive care.

Several studies have underscored the utility of clinical and biochemical markers in predicting COVID-19 outcomes. For instance, Zhao et al. (2021)[3] reported that elevated D-dimer levels were associated with increased mortality in COVID-19 patients. Similarly, Devang et al. (2022)[4] demonstrated that markers such as CRP, LDH, and ferritin were significant predictors of severe disease and mortality.[4] These studies emphasize the critical role of biochemical markers in managing COVID-19; however, comprehensive models integrating a broader range of inflammatory and hematological markers are missing, especially in the Indian context. The unique demographic, genetic, and environmental factors in India necessitate localized studies to ensure the findings’ applicability and effectiveness within this population.

Traditional statistical methods, such as logistic regression and linear discriminant analysis, assume linear relationships between predictors and outcomes, which limit their ability to capture complex, non-linear interactions among clinical and laboratory variables in COVID-19 patients. To address this limitation and enhance predictive accuracy and clinical utility, we aim to develop an ML-based prediction model that can identify intricate patterns and non-linear relationships in the data to predict mortality in COVID-19 patients using a combination of clinical, inflammatory, and hematological markers. Our objective is to harness these models’ predictive accuracy and clinical utility, developing an ML-based web application for triaging of patients in any future pandemic/epidemic-like situations to facilitate resource allocation and improved patient management.

MATERIALS AND METHODS

Data collection

This study was conducted at the All India Institute of Medical Sciences, New Delhi, India. We performed a retrospective observational analysis of 401 COVID-19-positive patients who were admitted to our institution between July and December 2020, as already published in our earlier article by Chopra et al. (2023).[1] The patients were unvaccinated, reflecting the population at the time of data collection. The data collection protocol was approved by the Institutional Ethics Committee (Ref No. IEC-578/19.06.2020, RP-03/2020). Here, we have used the same data with an ML-based approach for predictive modeling and compared the same with the traditional statistical methods used in our previous study.

The inclusion criteria for the study were a COVID-19-positive status confirmed by reverse transcription-polymerase chain reaction (RT-PCR) and the availability of clinical details and laboratory investigations within 3 days of COVID-19 positivity. Patients with a history of recent surgery or those with hematological malignancies were excluded. Demographic information (age and sex), clinical data (severity of disease at presentation: Where the diagnosis of severe COVID-19 was based on specific clinical signs, including a respiratory rate over 30 breaths/min, oxygen saturation below 94%), outcome data (survival or death), and laboratory data were collected.[5] The laboratory data encompassed TLC, ANC, NLR, CRP, IL-6, LDH, ferritin, and LCR.

Clinical severity was classified as a binary variable (severe = 1, not severe = 0) based on the WHO interim guidance, determined by treating physicians at the time of data collection. This binary classification was the only severity-related data available in the original dataset. Continuous clinical variables such as respiratory rate, oxygen saturation values, detailed comorbidity indices, and standardized imaging scores were not systematically recorded or available for analysis.

Data preprocessing

Several preprocessing steps were undertaken to prepare the data for ML model development. Initially, we conducted an exploratory data analysis and descriptive statistics to understand the structure of the dataset, identify missing values, and examine the distributions of key variables. Descriptive statistics were computed for all features to gain insights into the data.

Handling missing values was a crucial step in our preprocessing pipeline. For numerical features, we used median imputation, and for categorical features, mode imputation was applied. This ensured that no critical information was lost. Next, we encoded the categorical variable “sex” using one-hot encoding to convert it into a format suitable for ML algorithms. Numerical features were then standardized to have a mean of 0 and a standard deviation of 1, ensuring that features with larger ranges did not dominate the model training process.

Model development

In our study, we employed various ML algorithms to develop predictive models for COVID-19 mortality. The primary goal was to identify the model that best predicted mortality with the highest sensitivity, given the critical nature of correctly identifying patients at risk of death.

The dataset was split into training (75% of total observations) and testing (25%) sets, ensuring that the training set was used to train the models, while the testing set was reserved for evaluating their performance.

Models and hyperparameters tuning

We evaluated several models, each with unique strengths and hyperparameters. Summaries of the models used, their brief descriptions, and the hyperparameters tested have been included [Table 1]. Each model was trained on the training set using grid search cross-validation to tune hyperparameters and identify the best-performing model. We used stratified K-fold cross-validation with 10 folds to ensure that the models were evaluated robustly and that the results were not biased by the particular train-test split. The primary metric for model evaluation was sensitivity, as it is crucial to minimize false negatives in predicting mortality. Sensitivity represents the proportion of actual positive cases (deaths) correctly identified by the model.

Table 1: Machine learning algorithms used for initial performance evaluation along with the hyperparameter tuning.
Model Description Hyperparameters tested
Logistic regression Predicts the probability of an outcome based on input features using a logistic function. C (0.1, 1, 10), solver (liblinear, lbfgs)
Decision trees Splits the data into subsets based on feature values to make predictions. max_depth (3, 5, 7, None), min_samples_split (2, 5, 10)
Random forest Ensemble of decision trees that reduces overfitting and improves generalization. n_estimators (100, 200, 500), max_depth (5, 10, None)
Support vector machine Finds the optimal hyperplane to separate classes in the feature space. C (0.1, 1, 10)
Ridge classifier Linear classifier with L2 regularization to prevent overfitting. alpha (0.1, 1, 10)
AdaBoost classifier Combines weak classifiers to create a strong classifier through iterative reweighting. n_estimators (50, 100, 200), learning_rate (0.1, 0.5, 1.0)
Gradient boosting Builds an additive model in a forward stage-wise manner to optimize the loss function. n_estimators (100, 200, 500), max_depth (3, 5, 7), learning_rate (0.01, 0.1, 0.5)
Extra trees Randomized trees that split nodes randomly to reduce variance and improve generalization. n_estimators (100, 200, 500), max_depth (5, 10, None)
K-nearest neighbors Classifies based on the majority vote of the nearest neighbors. n_neighbors (3, 5, 7), weights (uniform, distance)
Gaussian naive bayes Probabilistic classifier based on Bayes’ theorem with strong independence assumptions. No hyperparameters
XGBoost Gradient boosting algorithm optimized for speed and performance. n_estimators (100, 200, 500), max_depth (3, 5, 7)
LightGBM Gradient boosting framework that uses tree-based learning algorithms. n_estimators (100, 200, 500), max_depth (3, 5, 7)
CatBoost Gradient boosting on decision trees with categorical feature support. iterations (100, 200, 500), depth (3, 5, 7)
Linear discriminant analysis Projects data to a lower dimension to maximize class separability. No hyperparameters
Quadratic discriminant analysis Uses quadratic decision surfaces to classify data. No hyperparameters
Dummy classifier Serves as a baseline by making random predictions based on class distribution. No hyperparameters

XGBoost: eXtreme gradient boosting, LightGBM: Light gradient boosting machine, CatBoost: Categorical boosting

Evaluation metrics

The performance of the models was evaluated using several metrics, defined mathematically as follows:

Accuracy: The overall proportion of correct predictions. Accuracy = (TP + TN)/(TP + TN + FP + FN), where TP is True Positives, TN is True Negatives, FP is False Positives, and FN is False Negatives. Sensitivity (recall): The proportion of actual positive cases (deaths) correctly identified by the model. Sensitivity = TP/(TP + FN)

Specificity: The proportion of actual negative cases (survivors) correctly identified by the model. Specificity = TN/(TN + FP)

Precision (positive predictive value [PPV]): The proportion of positive predictions that are actually positive. Precision = TP/(TP + FP)

Negative predictive value (NPV): The proportion of negative predictions that are actually negative. NPV = TN/(TN + FN)

Area under the receiver operating characteristic curve (AUC): A measure of the model’s ability to distinguish between classes, integrating the trade-offs between sensitivity and specificity across different thresholds.

The best model was determined based on its performance on the testing set, prioritizing sensitivity to ensure high-risk patients were correctly identified.

Feature reduction

After selecting the best-performing model, we conducted feature reduction to enhance clinical applicability. Initially, model development was done using all features. However, we reduced the feature set to the most relevant features to ensure that the model could be easily implemented in clinical practice. Feature importance was determined using “permutation feature importance analysis,” which measures the decrease in model performance (AUC) when each feature’s values are randomly shuffled. Features were ranked based on their permutation importance scores, and the top features with the highest importance scores were selected for the final reduced-feature model.

Software and libraries

The research employed Python as the principal programming language for data analysis and ML tasks. Essential libraries such as pandas for data manipulation, numpy for numerical computations, and scikit-learn for model development were instrumental in the study. Gradient boosting models were implemented using XGBoost (eXtreme Gradient Boosting), LightGBM (Light Gradient Boosting Machine), and CatBoost (Categorical Boosting). Data visualization was facilitated by matplotlib and seaborn, providing valuable insights into the dataset and model performance.

Web application for clinical practice

To facilitate the use of the predictive model in clinical practice, we developed a web application using Streamlit. This application allows clinicians to feed patient data as input and receive real-time mortality risk predictions based on the best-performing model. The user-friendly interface ensures that the model can be easily accessed and utilized in various clinical settings, thereby aiding in timely and informed decision-making.

RESULTS

The study included 401 COVID-19-positive patients, with 277 survivors and 124 deaths. Missing data were minimal, with only four patients missing LDH values and six missing IL-6 values, both of which were imputed using median imputation. The mean age of the survivors was 51.52 years (standard deviation [SD] = 16.06), whereas the mean age of the deceased was 53.89 years (SD = 14.62), which was not statistically significant (P = 0.147). The sex distribution was similar between survivors and deceased patients, with males constituting 65.34% of survivors and 69.35% of deceased patients (P = 0.492). However, disease severity at presentation was significantly higher among those who died (90.32%) compared to those who survived (19.86%), with a P < 0.001.

Key laboratory findings revealed significant differences between the two groups. Deceased patients had higher levels of IL-6, LDH, ferritin, CRP, TLC, neutrophil percentage, and NLR, while survivors had higher lymphocyte percentage, monocyte percentage, platelet count, and lymphocyte-toCRP ratio [Table 2].

Table 2: Table showing clinical parameters and laboratory findings of COVID-19 patients, comparing survivors and non-survivors.
Parameter Survived (n=277)
Mean±SD
Death (n=124)
Mean±SD
P-value
Age (years) 51.52±16.06 53.89 (14.62 0.147
Sex#– Male
Number (%)
181 (65.34%) 86 (69.35%) 0.492
Disease severity#– YES
Number (%)
55 (19.86%) 112 (90.32%) <0.001*
Interleukin 6 (pg/mL) 173.74±349.92 450.01±553.15 <0.001*
Lactate dehydrogenase (U/L) 670.49±558.84 835.35±641.71 0.014*
Ferritin (ng/mL) 996.54±1573.49 1738.16±2892.18 0.008*
C-Reactive protein (mg/dL) 9.92±11.65 15.25±10.85 <0.001*
Hemoglobin (g/dL) 10.89±2.76 11.22±2.88 0.289
Total leukocyte count (×103/μL) 11.52±6.42 15.30±9.64 <0.001*
Neutrophil percentage (%) 79.35±13.11 83.93±12.06 0.001*
Lymphocyte percentage (%) 13.00±9.88 9.83±10.27 0.004*
Monocyte percentage (%) 5.72±4.30 4.68±2.78 0.004*
Platelet count (×103/μL) 219.38±118.32 183.30±110.65 0.003*
Absolute neutrophil count (×103/μL) 9.48±6.02 13.31±8.98 <0.001*
Absolute lymphocyte count (×103/μL) 1.24±1.02 1.12±1.05 0.302
Neutrophil-to-Lymphocyte ratio 11.80±12.63 19.01±21.90 0.001*
Platelet-to-Lymphocyte ratio 301.39±760.01 273.17±283.36 0.59
Lymphocyte-to-C-Reactive protein ratio 11.27±32.83 1.83±4.37 <0.001*
P<0.05 (Student’s independent t-test), #Chi-squared test, SD: Standard deviation

The performance of various ML models was evaluated using multiple metrics, including accuracy, sensitivity, specificity, precision, NPL, and the AUC, with detailed results provided [Supplementary Table 1]. The support vector machine (SVM) exhibited a test sensitivity of 0.84, test specificity of 0.74, and an AUC of 0.82, alongside a high training accuracy of 0.86 but a lower test accuracy of 0.77, suggesting a degree of overfitting. The extra trees model, another ensemble method, achieved a test sensitivity of 0.81, test specificity of 0.79, and an AUC of 0.81, with a consistent test accuracy of 0.79. Random forest, notable for its ensemble approach, had the highest training accuracy at 0.97 yet showed a test sensitivity of 0.74 and an AUC of 0.82, with a test accuracy of 0.8, indicating potential overfitting. In contrast, XGBoost presented perfect training metrics (accuracy, sensitivity, specificity, and AUC all equal to 1), but its test sensitivity dropped to 0.58, and test accuracy to 0.75, highlighting significant overfitting. Observing these patterns, it is evident that while ensemble methods such as Extra Trees, CatBoost, and Random Forest provided high training performance, they were prone to overfitting, as indicated by the disparity between training and test metrics, especially in models like XGBoost, which showed extreme overfitting despite its initial perfect training scores.

Supplementary Table 1

Figure 1 illustrates the performance metrics of various models, highlighting the accuracy in training and test datasets, test sensitivity, specificity, and AUC for each model. The SVM model was noted for its balanced performance, particularly its sensitivity, which is crucial for minimizing false negatives in predicting mortality. SVM hyperparameters were optimized using GridSearchCV with 5-fold stratified cross-validation and recall scoring. The parameter grid included C = [0.01, 0.1, 1, 10], kernel = [“linear,” “rbf,” “poly”], gamma = [“scale,” “auto”], and degree = [2, 3, 4] for polynomial kernels. The best estimator was selected based on cross-validated recall performance, balancing sensitivity, and generalization.

Performance metrics of various machine learning models. AUC: Area under curve
Figure 1:
Performance metrics of various machine learning models. AUC: Area under curve

Given the high sensitivity of the SVM model, further optimization was performed by reducing the feature set to the top five most predictive features using permutation importance analysis of AUC: Clinical severity, IL-6, LDH, neutrophil percentage, and the NLR. The number of features (5) was selected based on the AUC [Supplementary Figure 1]. This reduced-feature SVM model slightly increased the AUC from 0.82 to 0.83 and improved test accuracy from 0.77 to 0.78 [Figure 2]. In addition, the confusion matrices for the SVM models with all features and with reduced features demonstrated a decrease in false positives (from 18 to 15) while maintaining the same number of true positives (26). The final model’s sensitivity, specificity, PPV, and NPV values are 0.83, 0.78, 0.63, and 0.91, respectively.

Supplementary Figure 1
Receiver operating characteristic (ROC) curves and confusion matrices for support vector machine models with all features versus reduced features.
Figure 2:
Receiver operating characteristic (ROC) curves and confusion matrices for support vector machine models with all features versus reduced features.

Overall, the reduced-feature SVM model provided a practical balance between complexity and predictive accuracy, making it well-suited for clinical implementation. The web application developed using Streamlit (https://covidprediction.streamlit.app/) allows clinicians to feed patient data as input and receive real-time mortality risk predictions, enhancing decision-making in managing COVID-19 patients [Figure 3].

COVID-19 survival prediction web application. (Available from - https://covid-prediction.streamlit.app/)
Figure 3:
COVID-19 survival prediction web application. (Available from - https://covid-prediction.streamlit.app/)

DISCUSSION

COVID-19 presented the medical fraternity with an enormous challenge in terms of resource allocation. This spurred extensive research into predictive modeling for patient outcomes, with ML techniques emerging as powerful tools for mortality prediction. In our previous study,[1] we used a combination of inflammatory and hematological markers with traditional statistical analysis, which helped us pinpoint key predictors of mortality, like CRP, IL-6, ferritin, and LCR. This method gave us results for identifying high-risk patients or those who are likely to face mortality irrespective of the severity of disease at the onset. This type of analysis could, however, not capture the non-linear relationships or interactions between variables, which can be crucial, especially in complex diseases like COVID-19.

Comparing the results of our previous study with the current ML approach reveals significant improvements in predictive accuracy. In the earlier study, individual laboratory parameters showed poor discriminatory performance, with AUC values ranging from 0.614 to 0.710.[1] The combination of four markers (CRP, IL-6, ferritin, and 1/LCR) improved prognostication but lacked a unified AUC measure. In contrast, our current ML model, particularly the optimized SVM with reduced features, achieved an AUC of 0.83. This marked improvement in AUC demonstrates the superior discriminative ability of the ML approach. Furthermore, the current model’s balanced performance across various metrics (sensitivity: 0.83, specificity: 0.78) suggests a more robust and clinically applicable tool for predicting COVID-19 mortality. ML models could uncover hidden patterns and interactions that traditional methods might miss. In addition, techniques like cross-validation and hyperparameter tuning make ML models more accurate and less prone to overfitting, particularly with large, multi-dimensional datasets. Another strength of ML is its flexibility, which allows for dynamic updates and feature reduction. For instance, with the SVM model, we were able to optimize the most important markers while keeping the model’s performance intact, thus bringing in cost-efficiency in laboratory testing.

Earlier studies have demonstrated the efficacy of various ML algorithms in identifying high-risk patients and predicting adverse outcomes.[6,7] Gao et al. (2020)[8] developed a mortality risk prediction model using four ML methods, achieving impressive AUC scores of 0.9621 in internal validation and 0.9760 and 0.9246 in two external validation cohorts. Their study highlighted the potential of ensemble-based models in accurately predicting COVID-19 mortality. Similarly, Subudhi et al. (2020)[9] compared 18 ML algorithms, finding that ensemble-based models consistently outperformed others in predicting both ICU admission and 28-day mortality. These studies underscored the utility of ML in processing complex clinical data to generate actionable insights for patient care. Vaid et al. (2020)[10] also demonstrated that ensemble models outperformed individual classifiers, with accuracy rates exceeding 90% in predicting near-term mortality.

Key predictive variables identified across multiple studies have included a combination of demographic factors, vital signs, and laboratory parameters. Gao et al. and Subudhi et al.[8,9] both emphasized the importance of age, vital signs (such as blood pressure, respiratory rate, and oxygen saturation), and laboratory markers (including LDH, D-dimer, troponin, and creatinine) in predicting mortality risk.[8,9] Ikemura et al. (2021)[11] utilized automated ML (autoML) to develop high-performing models, identifying the top 10 influential variables for mortality prediction. Their approach demonstrated the potential of automated techniques in streamlining the model development process. Tezza et al. (2021)[12] found that Random Forest outperformed other ML techniques in predicting in-hospital mortality, achieving a receiver operating characteristic (ROC) of 0.84. Meanwhile, Yu et al. (2021)[7] reported high accuracy (86.2%) and NPV (87.8%) using XGBoost for predicting the need for mechanical ventilation. Das et al. (2020)[13] developed an online prognostic tool using logistic regression, highlighting the importance of accessibility in clinical settings. Emami et al.[14] compared four ML algorithms, finding that Gradient Boosting Tree models performed best with an accuracy of 70% and ROC area of 0.857. These studies collectively highlighted the diverse array of ML algorithms being applied to COVID-19 outcome prediction, with ensemble methods often yielding superior results.

Our study builds upon this foundation while addressing several gaps in the existing literature, particularly in the context of the Indian population. Unlike many previous studies that focused solely on clinical or readily available laboratory parameters, our research incorporated a comprehensive set of inflammatory biomarkers, including IL-6, which has been less frequently studied in the Indian context. Our results are consistent with prior studies that emphasize the predictive value of laboratory markers for COVID-19 mortality. However, our study identified a unique combination of the top five most predictive features: Clinical severity, IL-6, LDH, neutrophil percent, and the NLR. This feature set is somewhat similar to those identified in the work of Alle et al. (2022),[15] indicating that the specific characteristics of the Indian patient population and the unvaccinated status of our cohort align with previous findings. The performance of our SVM model, with an AUC of 0.83 after feature reduction, compares favorably with previous studies, though it does not reach the exceptionally high AUC values reported by Gao et al. (2020).[8] However, our model’s balance of sensitivity (0.83) and specificity (0.78) suggests robust performance in identifying high-risk patients while maintaining a low false-positive rate. Notably, our study’s focus on unvaccinated patients provides valuable insights into mortality prediction for this vulnerable population, addressing a critical need as vaccination rates vary globally. Furthermore, our development of a web-based application for real-time risk prediction represents a significant step toward translating research findings into practical clinical tools, addressing the gap between model development and clinical implementation often seen in previous studies.

However, several limitations must be acknowledged. The data utilized were collected during the first wave of the COVID-19 pandemic, a period preceding the widespread availability of vaccines. Consequently, all patients in this study were unvaccinated, and the viral strains circulating at that time differ from current variants. Vaccinated individuals exhibit different immune responses, disease severity, and outcomes compared to unvaccinated patients. Therefore, models developed from pre-vaccination data may not accurately predict current patient outcomes, limiting the generalizability of our findings to the present scenario. Furthermore, the constantly evolving virus necessitates continuous updates to predictive models to maintain their relevance and accuracy. Hence, more than the usefulness of this study on the evaluation of COVID patients, it is the model of predictive analysis that we have developed that may be of interest.

The SVM model showed a decrease in AUC from 0.91 (training) to 0.82 (test), reflecting the common bias-variance tradeoff in ML. Overfitting, where models memorize training data rather than learning generalizable patterns, is a significant concern in clinical prediction modeling, potentially leading to inadequate or harmful clinical decisions (Kernbach and Staartjes, 2021).[16] As these authors emphasize, modest out-of-sample performance degradation is expected when prioritizing generalization, whereas massive performance gaps indicate problematic overfitting. Importantly, our SVM demonstrated the smallest performance degradation among all tested algorithms; XGBoost declined from AUC 0.99 to 0.75, while the SVM declined from 0.91 to 0.82, indicating effective regularization through feature reduction and hyperparameter optimization. Contributing factors for this overfitting may include limited sample size, single-center data collection, and inherent biological variability.

Another limitation is the absence of clinical symptoms and examination findings from the model. Parameters such as respiratory rate, oxygen saturation, comorbidities, and imaging results are crucial for a comprehensive assessment of a patient’s condition. The integration of clinical data could potentially enhance the model’s predictive power by providing additional context and capturing aspects of the disease not reflected in laboratory values alone. Initial clinical conditions in our patients were evaluated on the basis of their admission to the wards or the ICU. Future studies should consider incorporating a broader range of clinical features to develop more robust and accurate predictive models.

Despite these limitations, our study underscores the potential of ML approaches in improving mortality predictions in COVID-19 and similar infectious diseases. The ML model not only confirmed the importance of key biomarkers identified in our previous research but also enhanced predictive accuracy by effectively handling complex interactions between variables. This enhancement is evident in our results, where the SVM model with reduced features achieved an AUC of 0.83, compared to our previous study’s individual parameter AUCs ranging from 0.614 to 0.710. This advancement over traditional methods highlights the value of ML in clinical decision-making, enabling earlier identification of high-risk patients and more efficient allocation of medical resources.

Above all, this approach adopted for risk stratification, when coupled with the web application for mortality prediction, may come very handy in any future unforeseen pandemics/epidemics.

CONCLUSIONS

Machine learning, especially the Support Vector Machine model, proved to be a reliable tool for predicting mortality in unvaccinated COVID-19 patients. Using just a few routinely used markers—clinical severity, IL-6, LDH, neutrophil percentage, and NLR—the model achieved strong accuracy and was easy to apply in practice. The web-based tool developed from this model can help clinicians quickly identify high-risk patients and make better decisions about care and resource use during outbreaks.

Acknowledgment:

The authors extend their sincere gratitude to Tushar Sehgal, Ranjan Yadav, Suneeta Meena, Souvik Maitra, Kapil Dev Soni, Arulselvi Subramanian, Shyam Prakash, Purva Mathur, Sandeep Mittan, Sooyun Tavolacci, Ajeet Kaushik, Kiran Gulia, Ebrahim Mostafavi, Abhishek Gupta, Anjan Trikha, Ritu Gupta, Kunzang Chosdol, Anant Mohan, Kalaivani Mani, and Subrata Sinha for their invaluable contributions to the previous study. Their dataset and research have provided a strong foundation for the present work.

Author's contributions:

SKD: Conceptualized the study; SKD, KG: Supervision; SKD, PC: Worked for data extraction and initial analysis; KM: Carried out the machine learning related formal analysis and prepared the initial draft. All authors contributed to manuscript writing, critical review, and editing, and approved the final version of the manuscript.

Ethical approval:

The research/study was approved by the Institutional Review Board at All India Institute of Medical Sciences, Delhi, approval number IEC-578/19.06.2020, RP-03/2020, dated 19th June 2020.

Declaration of patient consent:

Patient’s consent not required as patients identity is not disclosed or compromised.

Conflicts of interest:

Dr Sudip Kumar Datta, is on the editorial board member of the journal.

Use of artificial intelligence (AI)-Assisted Technology for manuscript preparation:

The authors confirm that artificial intelligence (AI)-assisted technology was used [Claude 3.5 Sonnet LLM model (Anthropic)] for proofreading and grammar checks during manuscript writing.

Financial support and sponsorship: Nil.

References

  1. , , , , , , et al. A combination of inflammatory and hematological markers is strongly associated with the risk of death in both mild and severe initial disease in unvaccinated individuals with COVID-19 infection. EJIFCC. 2023;34:42-56.
    [Google Scholar]
  2. , , . Re-thinking data strategy and integration for artificial intelligence: Concepts, opportunities, and challenges. Appl Sci. 2023;13:7082.
    [CrossRef] [Google Scholar]
  3. , , , , , , et al. Associations of D-dimer on admission and clinical features of COVID-19 patients: A systematic review, meta-analysis, and meta-regression. Front Immunol. 2021;12:691249.
    [CrossRef] [PubMed] [Google Scholar]
  4. , , . Assessment of inflammatory markers and their association with disease mortality in severe COVID-19 patients of tertiary care hospital in South India. Egypt J Bronchol. 2022;16:55.
    [CrossRef] [Google Scholar]
  5. . Clinical management of severe acute respiratory infection (SARI) when COVID-19 disease is suspected. Interim guidance. Pediatr Med Rodz. 2020;16:9-26.
    [CrossRef] [Google Scholar]
  6. , , , , , , et al. COVID mortality prediction with machine learning methods: A systematic review and critical appraisal. J Pers Med. 2021;11:893.
    [CrossRef] [PubMed] [Google Scholar]
  7. , , , , , , et al. Machine learning methods to predict mechanical ventilation and mortality in patients with COVID-19. PLoS One. 2021;16:e0249285.
    [CrossRef] [PubMed] [Google Scholar]
  8. , , , , , , et al. Machine learning based early warning system enables accurate mortality risk prediction for COVID-19. Nat Commun. 2020;11:5033.
    [CrossRef] [PubMed] [Google Scholar]
  9. , , , , , , et al. Comparing machine learning algorithms for predicting ICU admission and mortality in COVID-19. NPJ Digit Med. 2020;4:87.
    [CrossRef] [PubMed] [Google Scholar]
  10. , , , , , , et al. Machine learning to predict mortality and critical events in a cohort of patients with COVID-19 in New York City: Model development and validation. J Med Internet Res. 2020;22:e24018.
    [CrossRef] [PubMed] [Google Scholar]
  11. , , , , , , et al. Using automated machine learning to predict the mortality of patients with COVID-19: Prediction model development study. J Med Internet Res. 2021;23:e23458.
    [CrossRef] [PubMed] [Google Scholar]
  12. , , , , , . Predicting in-hospital mortality of patients with COVID-19 using machine learning techniques. J Pers Med. 2021;11:343.
    [CrossRef] [PubMed] [Google Scholar]
  13. , , . Predicting CoVID-19 community mortality risk using machine learning and development of an online prognostic tool. PeerJ. 2020;8:e10083.
    [CrossRef] [PubMed] [Google Scholar]
  14. , , , . Predicting the mortality of patients with Covid-19: A machine learning approach. Health Sci Rep. 2023;6:e1162.
    [CrossRef] [PubMed] [Google Scholar]
  15. , , , , , , et al. COVID-19 risk stratification and mortality prediction in hospitalized Indian patients: Harnessing clinical data for public health benefits. PLoS One. 2022;17:e0264785.
    [CrossRef] [PubMed] [Google Scholar]
  16. , . Foundations of machine learning-based clinical prediction modeling: Part II-generalization and overfitting. Acta Neurochir Suppl. 2022;134:15-21.
    [CrossRef] [PubMed] [Google Scholar]
Show Sections