Unsupervised machine learning on tax returns : investigating unsupervised and semisupervised machine learning methods to uncover anomalous faulty tax returns

2020

In this thesis we investigate whether unsupervised and semisupervised machine learning

methods can be applied to detect undiscovered erroneous tax returns, and how the

properties of the underlying data affect method performance. To do this we test the

two fully unsupervised clustering algorithms K-means and DBSCAN, as well as the two

semisupervised approaches One-Class Support Vector Machines and autoencoders. We

use a sample of real anonymous tax returns, and evaluate model performance in situations

where erroneous returns constitutes a minor percentage of the dataset.

Model performance suggest that our methods are not suited to serve as stand alone

solutions for identifying faulty returns, with relatively low F1-scores between 0.1 and

0.15. Considering the resources needed to manually control a submitted tax return this

would likely not be economically feasible. The underwhelming performance is especially

clear when compared to a supervised boosted trees benchmark. However, a supervised

approach would most likely not be able to detect undiscovered errors on its own.

To further study the less supervised methods behaviour we simulate new tax returns based

on the original sample, where the differences between normal and faulty tax returns are

exaggerated. We find that this improves model performance, but the most exaggerated

differences would perhaps not occur in real life. The largest improvement did however

stem from changes to the distribution of the tax return features, and this property might

be more linked to what can be found in the data population.

If another data sample with these traits exist in the Tax administrations database, these

methods would be promising. Even if that is not the case, the possibilities of utilizing the

methods in combination with other approaches, to uncover new errors, is by itself worth

researching further.