Method for Fusing Predictor Levels with Application to Insurance Data

2018

The aim of this thesis is to determine whether the prediction accuracy of a model can be

improved by using a data-driven method to bin continuous variables and group the levels of

categorical variables. We use data on the policyholders of one of Gjensidige's insurance

products to perform our analysis, and specifically aim to improve Gjensidige's Poisson

regression model for predicting claim frequency, where the predictors are binned and

grouped manually today.

We analyze the effect of using a regularization framework that combines the Lasso method

and generalizations of the method that have been adapted to nominal and ordinal predictors.

These generalizations constrain coefficients and the differences between them, effectively

fusing and selecting predictor levels. By optimizing the resulting objective function in R

using the newly developed smurf package (Reynkens, Devriendt & Antonio, 2018), we

estimate a penalized Poisson regression model.

We reestimate a Poisson regression model using the selected and fused predictor levels as

input in order to reduce the bias of the estimates. The resulting model is compared with the

model Gjensidige currently uses for predicting claim frequency, to determine the effect of

using the data-driven approach. We validate the performance of the prediction models using

MSE and AIC as performance measures and find that our reestimated model performs

slightly better in terms of prediction accuracy, in addition to reducing the number of

parameters used in the model. We conclude that regularization can be used as a data-driven

method of binning and grouping predictor levels to improve prediction accuracy.