Predicting patent litigation :  a comprehensive comparison of machine learning algorithm performance in predicting patent litigation

2020

Patents are designed to act as an incentive for innovation by awarding exclusive property

rights to the inventor. And as such, patents are one of the main driving forces behind

innovation, and ultimately economic growth (Lanjouw and Schankerman, 2004). Patent

litigation, the legal process associated with legal disputes regarding patent rights, is hard

to predict, surrounded by uncertainty, can be ruinously expensive, and very difficult to

insure. Previous research has shown that there is potential for predicting patent litigation,

however based on limited data and limited algorithm sophistication.

The purpose of this thesis is to evaluate the extent of which patent litigation can

be predicted, what machine learning method is most appropriate, and what are the

characteristics that is important for the classifier. The goal is to contribute to reducing the

uncertainty that threatens the incentives of innovation by introducing more information

through better patent litigation prediction. In particular we focus on the patent litigation

insurance market as the most direct application for our research.

This thesis is inspired by the work of Lanjouw and Schankerman (2001) which forms the

basis of our research. Building on their work, more data and characteristics are added to

the analysis, before other more sophisticated machine learning algorithms are employed

and compared. The work relates to anomaly detection, and face similar challenges unique

to this area of research.

We find that patent litigation can to a large extent be predicted. Furthermore, adding

more characteristics and information increase the predictive power. The largest gains in

predictive power stems from the use of appropriate algorithms. Using the right algorithm

is much more important than using a more advanced or newer algorithm. The Random

Forest classifier is found to be the preferred method of predicting patent litigation on our

data, as it yields models with high levels of predictive power. We find that patent family

size, whether or not the patent is owned by a US company, and the number of backward

citations to be the most important characteristics that drives the prediction of litigation.

Keywords – NHH, Master Thesis, Patent Litigation Data, Patent Litigation Prediction,

Predictive Analysis, Logit, Random Forest, XGBoost, SVM