Chunk Smarter, Retrieve Better: Enhancing LLMs in Finance : An Empirical Comparison of Chunking Techniques in Retrieval Augmented Generation for Financial Reports
Abstract
This thesis investigates how Retrieval-Augmented Generation (RAG) improves the ability of Large Language Models (LLMs) to filter information from financial documents. For this task, we first develop NorwegianFinanceQA, a dataset containing 433 queries from the financial reports of 9 Norwegian companies, divided into text- and table-related queries. Next, we evaluate the retrieval accuracy and efficiency of RAG systems with different chunking techniques: character-based, recursive, and semantic splitting. Additionally, we propose a table-specific summarization approach. Our results suggest that table summaries achieve perfect accuracy for table queries while at the same time increasing efficiency. However, this improvement comes at the expense of text-query performance. Our findings highlight the importance of tailored chunking strategies when using LLMs and RAG systems for information retrieval in a financial context.