Balancing Data Protection and Model Accuracy : An Investigation of Protection Methods on Machine Learning Model Performance for a Bank Marketing Dataset
Abstract
The practice of sharing customer data among companies for marketing purposes is becoming
increasingly common. However, sharing customer-level data poses potential risks and serious
problems for businesses, such as substantial declines in brand value, erosion of customer trust,
loss of competitive advantage, and the imposition of legal penalties (Schneider et al. 2017).
These may eventually lead to financial loss and reputation damage for the companies. With the
growing awareness of the value of personal information, more companies and customers are
concerned about protecting data privacy.
In this paper, we used marketing data from a Portuguese bank to explore methods for balancing
prediction accuracy and customer data privacy using various machine learning and data privacy
techniques. The dataset includes observations from 45211 respondents and the observation
period is from May 2008 to November 2010. Our goal is to find a method that enables third
parties to share data with the bank while safeguarding customer privacy and maintaining
accuracy in predicting customer behaviour.
We tested several machine learning models: Logistic Regression, Random Forest, and Neural
Network (feedforward) on original data and then chose Random Forest, which gave the best
prediction performance, as the model to proceed to explore. After using two different data
privacy methods (Sampling and Random Noise) on the original data, we found the Random
Forest model gives us accuracy levels that are very close to the accuracy before using the
privacy methods. By doing this, we demonstrated a method for companies to protect customer
data privacy without sacrificing predictive accuracy. The results of this study will have
significant implications for companies that seek to share customer data while maintaining high
levels of privacy and accuracy.