Unsupervised machine learning on tax returns : investigating unsupervised and semisupervised machine learning methods to uncover anomalous faulty tax returns
MetadataVis full innførsel
- Master Thesis 
In this thesis we investigate whether unsupervised and semisupervised machine learning methods can be applied to detect undiscovered erroneous tax returns, and how the properties of the underlying data affect method performance. To do this we test the two fully unsupervised clustering algorithms K-means and DBSCAN, as well as the two semisupervised approaches One-Class Support Vector Machines and autoencoders. We use a sample of real anonymous tax returns, and evaluate model performance in situations where erroneous returns constitutes a minor percentage of the dataset. Model performance suggest that our methods are not suited to serve as stand alone solutions for identifying faulty returns, with relatively low F1-scores between 0.1 and 0.15. Considering the resources needed to manually control a submitted tax return this would likely not be economically feasible. The underwhelming performance is especially clear when compared to a supervised boosted trees benchmark. However, a supervised approach would most likely not be able to detect undiscovered errors on its own. To further study the less supervised methods behaviour we simulate new tax returns based on the original sample, where the differences between normal and faulty tax returns are exaggerated. We find that this improves model performance, but the most exaggerated differences would perhaps not occur in real life. The largest improvement did however stem from changes to the distribution of the tax return features, and this property might be more linked to what can be found in the data population. If another data sample with these traits exist in the Tax administrations database, these methods would be promising. Even if that is not the case, the possibilities of utilizing the methods in combination with other approaches, to uncover new errors, is by itself worth researching further.