Do Fraudulent Companies Employ Different Linguistic Features in Their Annual Reports? An Empirical Study Using Logistic Regression and Random Forest Methodologies
Abstract
The use of textual analysis to uncover fraudulent actions in 10-K filings is widespread. The
previous studies have looked at the Management Disclosure and Analysis (MD&A) section of
annual reports to predict illicit behaviour by analysing the tone of executives, with the
majority of those studies dating back 10 years or more. The primary goal of this research is to
find patterns in linguistic features of entire annual reports of convicted public businesses,
which were found using the Corporate Prosecution Registry database, and compare them to
non-fraudulent equivalents in the same industry. The algorithms of logistic regression and
random forest are implemented to discover important factors and make accurate predictions.
The accuracy rate, ROC-AUC value, and 10-fold cross-validation tools are performed to
validate the success of each method. The results of the logistic regression revealed that
corrupt organisations utilise a more negative, uncertain, and litigious tone. Furthermore, these
businesses employ more words with a high lexical diversity and minimal complexity. Based
on the Random Forest machine learning technique, the litigious variable is the most important
variable in the prediction of untruthful corporations. Moreover, each of the validation
methods demonstrates that the Random Forest methodology outperforms logistic regression.