LONG2RAG: Evaluating Long-Context & Long-Form Retrieval-Augmented Generation with Key Point Recall

Abstract

Retrieval-augmented generation (RAG) is a promising approach to address the limitations of fixed knowledge in large language models (LLMs). However, current benchmarks for evaluating RAG systems suffer from two key deficiencies: (1) they fail to adequately measure LLMs’ capability in handling long-context retrieval due to a lack of datasets that reflect the characteristics of retrieved documents, and (2) they lack a comprehensive evaluation method for assessing LLMs’ ability to generate long-form responses that effectively exploits retrieved information. To address these shortcomings, we introduce the LONG2RAG benchmark and the Key Point Recall (KPF) metric. LONG2RAG comprises 280 questions spanning 10 domains and across 8 question categories, each associated with 5 retrieved documents with an average length of 2,444 words. KPF evaluates the extent to which LLMs incorporate key points extracted from the retrieved documents into their generated responses, providing a more nuanced assessment of their ability to exploit retrieved information.

Publication
Findings of the Association for Computational Linguistics: EMNLP 2024
Hao Zhang
Hao Zhang
Staff Algorithm Engineer