Balancing Data Protection and Model Accuracy : An Investigation of Protection Methods on Machine Learning Model Performance for a Bank Marketing Dataset

2023

The practice of sharing customer data among companies for marketing purposes is becoming

increasingly common. However, sharing customer-level data poses potential risks and serious

problems for businesses, such as substantial declines in brand value, erosion of customer trust,

loss of competitive advantage, and the imposition of legal penalties (Schneider et al. 2017).

These may eventually lead to financial loss and reputation damage for the companies. With the

growing awareness of the value of personal information, more companies and customers are

concerned about protecting data privacy.

In this paper, we used marketing data from a Portuguese bank to explore methods for balancing

prediction accuracy and customer data privacy using various machine learning and data privacy

techniques. The dataset includes observations from 45211 respondents and the observation

period is from May 2008 to November 2010. Our goal is to find a method that enables third

parties to share data with the bank while safeguarding customer privacy and maintaining

accuracy in predicting customer behaviour.

We tested several machine learning models: Logistic Regression, Random Forest, and Neural

Network (feedforward) on original data and then chose Random Forest, which gave the best

prediction performance, as the model to proceed to explore. After using two different data

privacy methods (Sampling and Random Noise) on the original data, we found the Random

Forest model gives us accuracy levels that are very close to the accuracy before using the

privacy methods. By doing this, we demonstrated a method for companies to protect customer

data privacy without sacrificing predictive accuracy. The results of this study will have

significant implications for companies that seek to share customer data while maintaining high

levels of privacy and accuracy.