Method for Fusing Predictor Levels with Application to Insurance Data
Abstract
The aim of this thesis is to determine whether the prediction accuracy of a model can be
improved by using a data-driven method to bin continuous variables and group the levels of
categorical variables. We use data on the policyholders of one of Gjensidige's insurance
products to perform our analysis, and specifically aim to improve Gjensidige's Poisson
regression model for predicting claim frequency, where the predictors are binned and
grouped manually today.
We analyze the effect of using a regularization framework that combines the Lasso method
and generalizations of the method that have been adapted to nominal and ordinal predictors.
These generalizations constrain coefficients and the differences between them, effectively
fusing and selecting predictor levels. By optimizing the resulting objective function in R
using the newly developed smurf package (Reynkens, Devriendt & Antonio, 2018), we
estimate a penalized Poisson regression model.
We reestimate a Poisson regression model using the selected and fused predictor levels as
input in order to reduce the bias of the estimates. The resulting model is compared with the
model Gjensidige currently uses for predicting claim frequency, to determine the effect of
using the data-driven approach. We validate the performance of the prediction models using
MSE and AIC as performance measures and find that our reestimated model performs
slightly better in terms of prediction accuracy, in addition to reducing the number of
parameters used in the model. We conclude that regularization can be used as a data-driven
method of binning and grouping predictor levels to improve prediction accuracy.