Unsupervised machine learning on tax returns : investigating unsupervised and semisupervised machine learning methods to uncover anomalous faulty tax returns
Abstract
In this thesis we investigate whether unsupervised and semisupervised machine learning
methods can be applied to detect undiscovered erroneous tax returns, and how the
properties of the underlying data affect method performance. To do this we test the
two fully unsupervised clustering algorithms K-means and DBSCAN, as well as the two
semisupervised approaches One-Class Support Vector Machines and autoencoders. We
use a sample of real anonymous tax returns, and evaluate model performance in situations
where erroneous returns constitutes a minor percentage of the dataset.
Model performance suggest that our methods are not suited to serve as stand alone
solutions for identifying faulty returns, with relatively low F1-scores between 0.1 and
0.15. Considering the resources needed to manually control a submitted tax return this
would likely not be economically feasible. The underwhelming performance is especially
clear when compared to a supervised boosted trees benchmark. However, a supervised
approach would most likely not be able to detect undiscovered errors on its own.
To further study the less supervised methods behaviour we simulate new tax returns based
on the original sample, where the differences between normal and faulty tax returns are
exaggerated. We find that this improves model performance, but the most exaggerated
differences would perhaps not occur in real life. The largest improvement did however
stem from changes to the distribution of the tax return features, and this property might
be more linked to what can be found in the data population.
If another data sample with these traits exist in the Tax administrations database, these
methods would be promising. Even if that is not the case, the possibilities of utilizing the
methods in combination with other approaches, to uncover new errors, is by itself worth
researching further.