Predictive modelling of customer claims across multiple insurance policies : an empirical study of how individual customer insurance data can be used to assess customer risk across multiple insurance products by employing machine learning and advanced ensemble techniques
Abstract
In this master thesis, we have analysed how individual insurance customer data can be used to
assess customer risk across multiple insurance policies. Our dataset contains 63 variables
about the characteristics of each customer and five associated response variables provided by
Frende Forsikring. We have modelled the responses for claim propensity, claim frequency,
and total claim size for each customer. To evaluate the value of this customer data, we have
used multiple machine learning algorithms. These include XGBoost, LightGBM, random
forest, GLM and deep neural networks. We have also used different ensemble techniques to
gain further performance improvements from these models.
By comparing results achieved using customer insurance premium as the only explanatory
variable to the results achieved using all the additional customer characteristics we could
observe a considerable increase in predictive performance. Our findings show that gradient
boosting techniques can increase performance compared to generalized linear models. We
also observed that using multiple models in ensembles can increase performance compared to
any single model when assessing customer claim propensity and frequency. Although we
found stacked ensembles using multiple underlying models to provide increased performance
when used on claim propensity and frequency, we found a strong case for the use of
generalized linear models when modelling total claim size. Our thesis proposes a novel threestep ensemble model that uses claim propensity and claim frequency to determine the total
claim size of a customer, which may improve performance of total claim predictions.
Overall, our results show promise in using individual customer data to supplement the
traditional individual policy risk assessments. The results also underline the potential of
advanced ensembles to increase predictive performance on the individual customer data. The
results accentuate the importance of selecting the appropriate models and suitable error
metrics to achieve good predictive performance across different response variables. Our
findings illustrate the transparency issues associated with using highly flexible statistical
learning tools when compared to generalized linear models.