Applying Machine Learning in Corporate Default Prediction

Hans Dall-Møller & Oscar Monberg

Student thesis: Master thesis


The thesis investigates the degree to which it is possible to apply machine learning in corporate default prediction. Specifically, in order to investigate the overall problem statement, the thesis intends to test and answer three specific research questions. First, the thesis analyzes whether there is a difference in accuracy between logistic regression and random forest when predicting default. Next, the thesis tests whether the addition of non-firm-specific variables have any effect on model accuracy. Lastly, the thesis investigates whether the driving variables and the precision of the models are conditional on industry when predicting corporate default. The data used to test the research questions is private company information extracted from the Bureau van Dijk (BVD) database. In addition, the data complies with the following selection criteria. The analysis is conducted on data from 2000-2017, is performed on companies from France, Italy, Portugal and Spain, and fall within the BVD size classification Very Large, Large & Medium and lastly, the data is restricted to manufacturing and wholesale trade defined by SIC-code. The variables chosen fall within firm-specific and non-firm specific variables. The firm-specific variables consist of financial ratios from the categories; profitability, asset efficiency and solvency. The non-firmspecific variables consist of macroeconomic indicators, a stock market index and commodity prices. For the first research question, it is found that random forest outperformed logistic regression by 4.7 percentage points indicating that random forest is better at predicting corporate default. For second research question, it is found that the addition of non-firm-specific variables only minutely increases accuracy for random forest, but has no effect on logistic regression. Conclusively, the addition of non-firm-specific variables does not materially increase accuracy. For the third research question, it is found that accuracy is highest for the total sample, indicating no gain from splitting samples by industry. To substantiate this claim, it is found that the driving variables for both models tested are greatly similar between industries. In total, this indicates that the driving variables are not conditional on industry.

EducationsMSc in Finance and Investments, (Graduate Programme) Final Thesis
Publication date2019
Number of pages95