Predictive modelling of customer claims across multiple insurance policies : an empirical study of how individual customer insurance data can be used to assess customer risk across multiple insurance products by employing machine learning and advanced ensemble techniques
MetadataShow full item record
- Master Thesis 
In this master thesis, we have analysed how individual insurance customer data can be used to assess customer risk across multiple insurance policies. Our dataset contains 63 variables about the characteristics of each customer and five associated response variables provided by Frende Forsikring. We have modelled the responses for claim propensity, claim frequency, and total claim size for each customer. To evaluate the value of this customer data, we have used multiple machine learning algorithms. These include XGBoost, LightGBM, random forest, GLM and deep neural networks. We have also used different ensemble techniques to gain further performance improvements from these models. By comparing results achieved using customer insurance premium as the only explanatory variable to the results achieved using all the additional customer characteristics we could observe a considerable increase in predictive performance. Our findings show that gradient boosting techniques can increase performance compared to generalized linear models. We also observed that using multiple models in ensembles can increase performance compared to any single model when assessing customer claim propensity and frequency. Although we found stacked ensembles using multiple underlying models to provide increased performance when used on claim propensity and frequency, we found a strong case for the use of generalized linear models when modelling total claim size. Our thesis proposes a novel threestep ensemble model that uses claim propensity and claim frequency to determine the total claim size of a customer, which may improve performance of total claim predictions. Overall, our results show promise in using individual customer data to supplement the traditional individual policy risk assessments. The results also underline the potential of advanced ensembles to increase predictive performance on the individual customer data. The results accentuate the importance of selecting the appropriate models and suitable error metrics to achieve good predictive performance across different response variables. Our findings illustrate the transparency issues associated with using highly flexible statistical learning tools when compared to generalized linear models.