Do Fraudulent Companies Employ Different Linguistic Features in Their Annual Reports? An Empirical Study Using Logistic Regression and Random Forest Methodologies
MetadataShow full item record
- Master Thesis 
The use of textual analysis to uncover fraudulent actions in 10-K filings is widespread. The previous studies have looked at the Management Disclosure and Analysis (MD&A) section of annual reports to predict illicit behaviour by analysing the tone of executives, with the majority of those studies dating back 10 years or more. The primary goal of this research is to find patterns in linguistic features of entire annual reports of convicted public businesses, which were found using the Corporate Prosecution Registry database, and compare them to non-fraudulent equivalents in the same industry. The algorithms of logistic regression and random forest are implemented to discover important factors and make accurate predictions. The accuracy rate, ROC-AUC value, and 10-fold cross-validation tools are performed to validate the success of each method. The results of the logistic regression revealed that corrupt organisations utilise a more negative, uncertain, and litigious tone. Furthermore, these businesses employ more words with a high lexical diversity and minimal complexity. Based on the Random Forest machine learning technique, the litigious variable is the most important variable in the prediction of untruthful corporations. Moreover, each of the validation methods demonstrates that the Random Forest methodology outperforms logistic regression.