dc.description.abstract | This thesis investigates the potential applicability of machine learning techniques m
predictive modelling on corporate insurance customers. The focus is on predicting a
binary classification of claim occurrences and a customer's total claim size. Additionally,
to illustrate practical usage, the respective best performing models were combined in an
experimental setting to predict total expected cost and to identify good customers.
The data set is supplied by Frende Forsikring and consist of aggregated customer data.
The aggregated data summarizes a company's characteristics, total premiums, number of
claims, claim sizes and the policies they hold. Prior to data preprocessing the data consist
of 26 293 different companies totaling 116 219 observations and 436 variables.
The study is split in two. First, the machine learning techniques CART, Random Forest,
XGBoost and Neural Networks are compared with a benchmark GLM. Secondly, the thesis
explores the predictive gain of aggregated data by using three input groups: the premium,
using the initial aggregated data and using aggregated data with feature engineered time
variables.
The results show that all machine learning models outperformed GLM when classifying
claim occurrences. Additionally, all models showed an increase in predictive capabilities
when including aggregated data, but little to no gain including time variables. XGBoost
was the best performing model with an ROC-AUC of 0.8457. Resampling techniques
did not contribute significantly to the performance to any of the models. In terms of
predicting total claim size, no models produced satisfactory results. XGBoost performed
best with a RMSE of 271725. The majority of the models performed best with premium as
the only feature, indicating that the usage of aggregated data is not suited for predicting
the response.
Overall, this study shows that machine learning can increase the predictive performance
compared to GLMs. The results also indicate that aggregated data have the potential in
terms of predicting claim occurrences, and can be used as a supplement in the actuarial
world of risk assessment. | en_US |