Unsupervised machine learning on tax returns : investigating unsupervised and semisupervised machine learning methods to uncover anomalous faulty tax returns

Gedde, Nora; Sandvik, Ida-Sofie

dc.contributor.advisor	Andersson, Jonas
dc.contributor.author	Gedde, Nora
dc.contributor.author	Sandvik, Ida-Sofie
dc.date.accessioned	2020-09-25T10:42:43Z
dc.date.available	2020-09-25T10:42:43Z
dc.date.issued	2020
dc.identifier.uri	https://hdl.handle.net/11250/2679665
dc.description.abstract	In this thesis we investigate whether unsupervised and semisupervised machine learning methods can be applied to detect undiscovered erroneous tax returns, and how the properties of the underlying data affect method performance. To do this we test the two fully unsupervised clustering algorithms K-means and DBSCAN, as well as the two semisupervised approaches One-Class Support Vector Machines and autoencoders. We use a sample of real anonymous tax returns, and evaluate model performance in situations where erroneous returns constitutes a minor percentage of the dataset. Model performance suggest that our methods are not suited to serve as stand alone solutions for identifying faulty returns, with relatively low F1-scores between 0.1 and 0.15. Considering the resources needed to manually control a submitted tax return this would likely not be economically feasible. The underwhelming performance is especially clear when compared to a supervised boosted trees benchmark. However, a supervised approach would most likely not be able to detect undiscovered errors on its own. To further study the less supervised methods behaviour we simulate new tax returns based on the original sample, where the differences between normal and faulty tax returns are exaggerated. We find that this improves model performance, but the most exaggerated differences would perhaps not occur in real life. The largest improvement did however stem from changes to the distribution of the tax return features, and this property might be more linked to what can be found in the data population. If another data sample with these traits exist in the Tax administrations database, these methods would be promising. Even if that is not the case, the possibilities of utilizing the methods in combination with other approaches, to uncover new errors, is by itself worth researching further.	en_US
dc.language.iso	eng	en_US
dc.subject	business analytics	en_US
dc.title	Unsupervised machine learning on tax returns : investigating unsupervised and semisupervised machine learning methods to uncover anomalous faulty tax returns	en_US
dc.type	Master thesis	en_US
dc.description.localcode	nhhmas	en_US

Tilhørende fil(er)

Filnavn:: masterthesis.pdf
Størrelse:: 1.263Mb
Format:: PDF

Åpne

Denne innførselen finnes i følgende samling(er)

Master Thesis [4372]

Vis enkel innførsel