Machine learning in default Prediction : the incremental power of machine learning techniques in mortgage default prediction
Abstract
In this thesis, alternative machine learning techniques have been used to test if these perform
better than a Logistic Regression in predicting default on retail mortgages. It is found that the
ROC AUC statistic is slightly better for the advanced machine learning techniques, i.e. the
Neural Networks, Support Vector Machines and Random Forests. Importantly, all classifiers
are trained on the same variables, which are all Weight of Evidence transformed. This
enables us to compare the results and view the incremental predictive power as solely a
result of the classifiers. Also, it enables us to use the same methodology for probability of
default modelling as practitioners currently use, i.e. with Weight of Evidence transformed
variables.
The analysis is based on a dataset with observations on each loan issued from a financial
services firm in the market for retail mortgages in the years 2009-2017. After univariate and
multivariate analysis, the number of candidate variables are reduced from 549 to 19.
The best model is the deep Neural Network, with an impressive ROC AUC of 0,902. This is
very high for prediction of default. Still, the Logistic Regression model also has a very high
statistic of 0,882. A more primitive machine learning technique is also included in the
analysis, the Decision Tree. As expected, this classifier has the lowest ROC AUC of 0,732.
Through the exploratory analysis with WoE variables interesting relationships are found,
which may enjoy some readers.
Keywords – Probability of Default, PD, Mortgage default, Bankruptcy prediction, Weight of
Evidence, Basel, IRB, Neural Network, Support Vector Machine, Random Forest, K-Nearest
Neighbor, Decision Tree, Logistic Regression, ROC, Confusion Matrix