Credit Risk Modeling With Text Data Using Supervised Learning

Nicklas Jensen & Anders Lynge Dissing

Student thesis: Master thesis


This research paper explores whether it is possible to improve standard bankruptcy models that only uses financial ratios, by complementing these standard models with information from unstructured text data (management’s review and auditor’s report of the annual report). The unstructured text data is quantified using three methods; Log(length), complexity (Fog Index) and sentiment. Firstly three standard models are constructed using three different Machine Learning methodologies; Logistic Regression, Random Forest and Neural Network. All models are optimized on the performance measure Balanced Accuracy, and furthermore the training data is undersampled with regard to the best possible Balanced Accuracy. Afterwards, six new variables, which consists of the quantified unstructured text data, are added to the models, thus enabling a comparison of the models with and without the new variables. This paper finds no significant improvements by using information gathered in the management’s review and auditor’s report. Though some significant effects are observed, these must be taken with reservation, as this can be an effect of the different undersample strategies, and thus not be directly linked to the additional information added by the new variables

EducationsMSc in Finance and Accounting, (Graduate Programme) Final Thesis
Publication date2021
Number of pages162