Machine learning in default Prediction : the incremental power of machine learning techniques in mortgage default prediction

2019

In this thesis, alternative machine learning techniques have been used to test if these perform

better than a Logistic Regression in predicting default on retail mortgages. It is found that the

ROC AUC statistic is slightly better for the advanced machine learning techniques, i.e. the

Neural Networks, Support Vector Machines and Random Forests. Importantly, all classifiers

are trained on the same variables, which are all Weight of Evidence transformed. This

enables us to compare the results and view the incremental predictive power as solely a

result of the classifiers. Also, it enables us to use the same methodology for probability of

default modelling as practitioners currently use, i.e. with Weight of Evidence transformed

variables.

The analysis is based on a dataset with observations on each loan issued from a financial

services firm in the market for retail mortgages in the years 2009-2017. After univariate and

multivariate analysis, the number of candidate variables are reduced from 549 to 19.

The best model is the deep Neural Network, with an impressive ROC AUC of 0,902. This is

very high for prediction of default. Still, the Logistic Regression model also has a very high

statistic of 0,882. A more primitive machine learning technique is also included in the

analysis, the Decision Tree. As expected, this classifier has the lowest ROC AUC of 0,732.

Through the exploratory analysis with WoE variables interesting relationships are found,

which may enjoy some readers.

Keywords – Probability of Default, PD, Mortgage default, Bankruptcy prediction, Weight of

Evidence, Basel, IRB, Neural Network, Support Vector Machine, Random Forest, K-Nearest

Neighbor, Decision Tree, Logistic Regression, ROC, Confusion Matrix