AI models show promise, not perfection, in heart transplant mortality prediction

Beyond raw performance metrics, the review also identified key clinical variables that consistently influenced the accuracy of ML models. Patient-related factors such as age, functional status, diagnosis, and creatinine or bilirubin levels were among the most important predictors across nearly all high-performing models. For pediatric patients, congenital heart defects and time in critical care (Status 1A) emerged as particularly salient indicators.


CO-EDP, VisionRICO-EDP, VisionRI | Updated: 05-04-2025 18:08 IST | Created: 05-04-2025 18:08 IST
AI models show promise, not perfection, in heart transplant mortality prediction
Representative Image. Credit: ChatGPT

Artificial intelligence models are increasingly being used to predict survival outcomes after heart transplantation, but current algorithms offer only moderate accuracy, according to a new comprehensive review published in Frontiers in Artificial Intelligence. The study, titled “Mortality Prediction of Heart Transplantation Using Machine Learning Models: A Systematic Review and Meta-Analysis,” analyzes data from 17 studies and reveals both the clinical potential and the limitations of machine learning (ML) in one of the most complex areas of cardiac medicine.

The researchers, led by a multidisciplinary team from Iranian medical institutions, assessed the predictive performance of ML algorithms such as random forests, CatBoost, support vector machines (SVMs), and artificial neural networks (ANNs) by calculating the pooled area under the curve (AUC), a key metric of diagnostic accuracy. The findings highlight a critical need for methodological standardization, external validation, and consistent data reporting to improve reliability and clinical applicability.

How accurate are machine learning models at predicting mortality after heart transplantation?

The authors examined the performance of ML algorithms in predicting mortality across timeframes ranging from three months to 10 years after transplantation. They found that while some models, especially CatBoost and ensemble methods, outperformed traditional regression-based approaches, the overall accuracy remained below the threshold generally considered clinically actionable. The average AUC of 0.65 falls short of the 0.90 benchmark typically associated with excellent diagnostic performance.

Notably, when only the best-performing model from each study was pooled, the AUC rose to 0.73, demonstrating that well-designed ML applications have potential for improving survival prediction in heart transplant recipients. For instance, one of the highest-performing models, developed by Miller et al. using a random forest (RF) algorithm, achieved an AUC of 0.89 on a large dataset of over 67,000 cases. In contrast, the model with the lowest performance came from Nilsson et al., using an ANN with a reported AUC of 0.64.

The meta-regression analysis also found that the longer the time since transplantation, the better the ML models performed. This suggests that ML algorithms may be particularly valuable for long-term monitoring and prognosis, rather than immediate post-operative risk stratification.

The type of algorithm was another important factor in performance. Ensemble models and gradient boosting approaches consistently outperformed single-layer perceptrons and simpler decision tree-based classifiers. Notably, the study found no statistically significant performance difference between traditional machine learning and deep learning models, challenging assumptions that neural networks are universally superior.

What patient and donor characteristics matter most in mortality prediction?

Beyond raw performance metrics, the review also identified key clinical variables that consistently influenced the accuracy of ML models. Patient-related factors such as age, functional status, diagnosis, and creatinine or bilirubin levels were among the most important predictors across nearly all high-performing models. For pediatric patients, congenital heart defects and time in critical care (Status 1A) emerged as particularly salient indicators.

Donor characteristics also played a significant role. Donor age, ischemic time, and cytomegalovirus (CMV) status were repeatedly cited as top predictors in multiple studies. For example, Lisboa et al. and Nilsson et al. found donor age to be highly predictive of one-year mortality, while Zhou et al. emphasized the importance of graft-related indicators.

Transplant process factors, including ventilator use, total ischemic time, and the requirement for post-operative dialysis, were also frequently included as high-weight variables in ML models. Some studies, like that of Kampaktsis et al., found that post-transplant complications such as the need for hemodialysis strongly predicted early mortality.

These findings point to the potential of ML algorithms not only as predictive tools but also as diagnostic aids capable of refining clinical risk stratification. By accounting for a wide array of variables, many of which interact non-linearly, ML systems may eventually improve upon traditional models like the Donor Risk Index (DRI), Risk Stratification Score (RSS), and IMPACT.

What are the limitations of existing models and how can they be improved?

Despite the promise shown by certain algorithms, the review underscores the limitations that still hinder the real-world application of ML in heart transplantation. Eight of the 17 included studies were found to have a high risk of bias, particularly in the flow and timing of data collection. External validation was limited, with only four studies testing their models outside of the original datasets, raising questions about the generalizability of the results.

High heterogeneity was another major issue. Variations in study population characteristics, algorithm design, feature selection methods, hyperparameter tuning, and data preprocessing all contributed to performance variability across studies. For example, while some researchers used medical experts for feature selection, others relied on automated ML methods. Similarly, handling of missing data ranged from simple exclusions to sophisticated imputation techniques, with little consistency in reporting.

The authors call for future research to adhere to reporting standards such as the TRIPOD+AI guidelines, which aim to improve transparency, reproducibility, and quality in the development of clinical prediction models using machine learning. They also recommend the inclusion of more pediatric-specific studies, as only one pediatric-focused study was eligible for meta-analysis, limiting subgroup insights.

To sum up, while ML models are not yet ready to replace conventional clinical risk scores, they are steadily improving and could eventually play a central role in post-transplant decision-making. Better standardization, external validation, and integration of domain expertise are key steps toward achieving that goal.

  • FIRST PUBLISHED IN:
  • Devdiscourse
Give Feedback