Do Fraudulent Companies Employ Different Linguistic Features in Their Annual Reports? An Empirical Study Using Logistic Regression and Random Forest Methodologies

2022

The use of textual analysis to uncover fraudulent actions in 10-K filings is widespread. The

previous studies have looked at the Management Disclosure and Analysis (MD&A) section of

annual reports to predict illicit behaviour by analysing the tone of executives, with the

majority of those studies dating back 10 years or more. The primary goal of this research is to

find patterns in linguistic features of entire annual reports of convicted public businesses,

which were found using the Corporate Prosecution Registry database, and compare them to

non-fraudulent equivalents in the same industry. The algorithms of logistic regression and

random forest are implemented to discover important factors and make accurate predictions.

The accuracy rate, ROC-AUC value, and 10-fold cross-validation tools are performed to

validate the success of each method. The results of the logistic regression revealed that

corrupt organisations utilise a more negative, uncertain, and litigious tone. Furthermore, these

businesses employ more words with a high lexical diversity and minimal complexity. Based

on the Random Forest machine learning technique, the litigious variable is the most important

variable in the prediction of untruthful corporations. Moreover, each of the validation

methods demonstrates that the Random Forest methodology outperforms logistic regression.