Evaluating ChatGPT-4 and Bard in Categorizing Investor Risk Profiles
Abstract
This study aims to examine the performance of OpenAI’s ChatGPT-4 and Google’s Bard in categorizing investors’ risk profiles. The objectives were to compare the chatbots’ assessments with those of financial advisors and to assess the consistency of their evaluations over time. Two research questions were stated: “How do ChatGPT and Bard categorize investor risk profiles compared to financial advisors?” and “How consistent are the chatbots’ categorizations over time?”
The study included ten distinct investor descriptions (client cases), which the chatbots were asked to categorize weekly from October 7th through November 25th, 2023. The assessments of ChatGPT and Bard were compared with those from financial advisors from the same bank.
To compare assessments from ChatGPT, Bard, and the bankers, multiple Kruskal- Wallis tests were conducted, followed by Dunn’s tests for post hoc analysis. Additionally, Welch’s t-tests were used as an adjunct methodological measure to validate the results, checking the consistency of findings across the statistical analyses, even under varying data assumptions. A qualitative analysis of the chatbots’ responses was conducted in instances where their assessments deviated from those of the bankers with statistical significance. A repeated measures ANOVA was used to assess the chatbot’s consistency.
The results from the non-parametric tests indicated that ChatGPT’s and Bard’s assessments differed from those of bankers for half of the clients. Among these clients, both chatbots assessed the client’s risk profiles more conservatively for three and higher for one. Furthermore, the results indicated that the chatbots were relatively consistent in their assessments over time. Despite some variations in their assessed risk scores for each client, these variations were relatively minor and did not indicate notable inconsistencies. The consistency of the chatbots was also supported by a lack of statistically significant difference in their assessments based on the conversational method used in the study.
The qualitative analysis revealed several weaknesses in the chatbots’ reasoning, affecting their accuracy. These limitations included a lack of personalized recommendations, reliance on general principles, absence of factual support, and a lack of human-like understanding. These results highlight the importance of cautious reliance on such tools for risk profiling, especially considering the identified weaknesses in reasoning.