The Impact of Machine Learning and Aggregated Data on Corporate Insurance Modelling: An Empirical Study on the Prospective Gains of Machine Learning Techniques Using New Data Sources In the Insurance Industry
MetadataShow full item record
- Master Thesis 
This thesis investigates the potential applicability of machine learning techniques m predictive modelling on corporate insurance customers. The focus is on predicting a binary classification of claim occurrences and a customer's total claim size. Additionally, to illustrate practical usage, the respective best performing models were combined in an experimental setting to predict total expected cost and to identify good customers. The data set is supplied by Frende Forsikring and consist of aggregated customer data. The aggregated data summarizes a company's characteristics, total premiums, number of claims, claim sizes and the policies they hold. Prior to data preprocessing the data consist of 26 293 different companies totaling 116 219 observations and 436 variables. The study is split in two. First, the machine learning techniques CART, Random Forest, XGBoost and Neural Networks are compared with a benchmark GLM. Secondly, the thesis explores the predictive gain of aggregated data by using three input groups: the premium, using the initial aggregated data and using aggregated data with feature engineered time variables. The results show that all machine learning models outperformed GLM when classifying claim occurrences. Additionally, all models showed an increase in predictive capabilities when including aggregated data, but little to no gain including time variables. XGBoost was the best performing model with an ROC-AUC of 0.8457. Resampling techniques did not contribute significantly to the performance to any of the models. In terms of predicting total claim size, no models produced satisfactory results. XGBoost performed best with a RMSE of 271725. The majority of the models performed best with premium as the only feature, indicating that the usage of aggregated data is not suited for predicting the response. Overall, this study shows that machine learning can increase the predictive performance compared to GLMs. The results also indicate that aggregated data have the potential in terms of predicting claim occurrences, and can be used as a supplement in the actuarial world of risk assessment.