Predicting Dropout Probabilities using Supervised Machine Learning

Saifullah Babrak & Andreas Schade

Student thesis: Master thesis


In this master thesis we investigate how machine learning algorithms can be used in the industry of education. More specific we will investigate if machine learning algorithm can be used to predict the dropout probabilities for high school students. By using Edaptios anonymized data on the students, we will try different algorithms as Logistic Regression, Random Forest, XGboost and Neural Networks to see if it is possible to predict a useful dropout probability. Lastly we will contruct a Stacking model by combining all the previous mentioned models except Neural Network. For every model we will fit a balanced and unbalanced model. Furthermore, we will test different thresholds and find the best performing model. After testing the models we conclude that the logistic regresssion models does not obtain any useful results. Random Forest and XGboost on the other hand result in high performing models which can be used for predicting dropout probabilities with a low misclassicationrate. We can also conclude that Neural Network achive the same performs level as the logistic regression models. Lastly we can conclude that Stacking obtaing similar results to Random Forest and XGboost.

EducationsMSc in Business Administration and Mathematical Business Economics, (Graduate Programme) Final Thesis
Publication date2021
Number of pages110
SupervisorsAnders Rønn-Nielsen