Default Prediction with the use of Machine Learning

Andreas Kjøller-Hansen & Sara Skovhøj Jensen

Student thesis: Master thesis


This thesis investigates the classification of default and non-default on companies from the USA over the time period 1987-2015. The data is split according to two time horizons and whether market variables are included or not. This results in four data sets. The classification is done with the use of five different machine learning methods, logistic regression, neural network, linear SVM RBF SVM, and random forest. The models are evaluated by the accuracy and the distribution of type 1 and type 2 errors, and the ROC curve and its AUC measure. When only taking the accuracy and the distribution of the error types into account, the best methods when predicting default on data including accounting and market variables are neural network and linear SVM, whereas the best method on the data sets only including accounting variables is random forest. When the AUC measure and the ROC curve is taken into account, random forest is the best to predict default at all tested data sets. Overall the conclusion is that random forest, in general, is the most appropriate method when it comes to the empirical results on the data sets used in this thesis. The thesis also investigates variable selection with the use of logistic regression and random forest, and it concludes that the two methods are conflicting since random forest states some variables as least important variables, while logistic regression includes these in its models.

Finally, the results of the thesis are transferred to non-listed Danish firms with a focus on the capital requirement of the credit lender. There are two approaches to calculate the capital requirement, the IRB and the standardized approach. The larger credit institutions in Denmark primarily use the IRB approach, which uses the credit risk model of the credit lender to calculate values for PD, LGD, and EAD, and the approach benefits from setting lower capital requirements. There are other benefits of having a more precise credit risk model since it will imply the calculation of provision being more accurate and the evaluation of potential customers being more trustworthy and fair. The last part shows that the empirical results of the thesis are in accordance with other results from previous default studies

EducationsMSc in Finance and Accounting, (Graduate Programme) Final ThesisMSc in Accounting, Strategy and Control, (Graduate Programme) Final Thesis
Publication date2020
Number of pages119
SupervisorsJens Dick-Nielsen