Interpretable Machine Learning in Cardiovascular Diagnosing: The Role of Machine Learning in Applied Econometrics

Andreas C. Markussen

Student thesis: Master thesis

Abstract

The aim of this thesis is to combine the methods of econometrics with the scientific approach within medicine to predict the prevalence of heart arrhythmia in patients following a percutaneous coronary intervention procedure. The thesis is a critical look at the shortcomings of using linear models to map complex functions. The research relies on registry-based patient data collected by the National Hospital of Denmark (Rigshospitalet) and is provided by the National Statistics Authority of Denmark (Danmarks Statistik or DST) through access to a research server. The standard research pipeline of the cardiovascular department of Rigshospitalet, which relies heavily on survival analytics, is compared with statistical approaches used in data science. Throughout the thesis, increasingly complex methods are applied in order to uncover causal relationships of cardiovascular diseases. It is found that the results of unsupervised learning models are not significantly related to the outcome of arrhythmia, and also perform poorly compared to a simple, medical segmentation, and therefore have little relevance. Supervised machine learning models outperform the baseline Cox regression and logistic regression models used in medicine on out-of-sample data, showing that the reliance on these models can lead researchers to arrive at incorrect conclusions. It is found that an xgboost, a boosted tree ensemble model, performs best, achieving a recall of 82.2% and a precision of 23.4%, resulting in an F2-score of 0.18. As it is a black box model, an implementation of a shapley-based local surrogate model is introduced to produce actionable insights for medical personnel. It is found that the xgboost model determines different features to be relevant than the predominantly-used Cox regressions. This implicates that the reliance on these models could lead a practitioner to draw incorrect conclusions. Furthermore, several non-linear relationships in arrhythmia determinants are uncovered. The main cause of the sub-performance of the Cox and logistic regressions is determined to be the inherent linearity assumption, which the xgboost does not conform to. This point is proved by inspecting the partial dependence plots of the underlying SHAP values, which show that linearity assumption is violated. It is concluded that data science approaches can be used to drive high-level hypotheses for later medical verification, as well as to challenge existing assumptions about arrhythmia or other similar cases based on complex data.

EducationsMSc in Applied Economics and Finance, (Graduate Programme) Final Thesis
LanguageEnglish
Publication date2021
Number of pages99