The Impact of Machine Learning and Aggregated Data on Corporate Insurance Modelling: An Empirical Study on the Prospective Gains of Machine Learning Techniques Using New Data Sources In the Insurance Industry

2022

This thesis investigates the potential applicability of machine learning techniques m

predictive modelling on corporate insurance customers. The focus is on predicting a

binary classification of claim occurrences and a customer's total claim size. Additionally,

to illustrate practical usage, the respective best performing models were combined in an

experimental setting to predict total expected cost and to identify good customers.

The data set is supplied by Frende Forsikring and consist of aggregated customer data.

The aggregated data summarizes a company's characteristics, total premiums, number of

claims, claim sizes and the policies they hold. Prior to data preprocessing the data consist

of 26 293 different companies totaling 116 219 observations and 436 variables.

The study is split in two. First, the machine learning techniques CART, Random Forest,

XGBoost and Neural Networks are compared with a benchmark GLM. Secondly, the thesis

explores the predictive gain of aggregated data by using three input groups: the premium,

using the initial aggregated data and using aggregated data with feature engineered time

variables.

The results show that all machine learning models outperformed GLM when classifying

claim occurrences. Additionally, all models showed an increase in predictive capabilities

when including aggregated data, but little to no gain including time variables. XGBoost

was the best performing model with an ROC-AUC of 0.8457. Resampling techniques

did not contribute significantly to the performance to any of the models. In terms of

predicting total claim size, no models produced satisfactory results. XGBoost performed

best with a RMSE of 271725. The majority of the models performed best with premium as

the only feature, indicating that the usage of aggregated data is not suited for predicting

the response.

Overall, this study shows that machine learning can increase the predictive performance

compared to GLMs. The results also indicate that aggregated data have the potential in

terms of predicting claim occurrences, and can be used as a supplement in the actuarial

world of risk assessment.