LLMs are Biased Evaluators But Not Biased for Fact-Centric Retrieval Augmented Generation

Chen, Yen-Shan; Jin, Jing; Kuo, Peng-Ting; Huang, Chao-Wei; Chen, Yun-Nung

doi:10.18653/v1/2025.findings-acl.1369

LLMs are Biased Evaluators But Not Biased for Fact-Centric Retrieval Augmented Generation

Journal

Proceedings of the Annual Meeting of the Association for Computational Linguistics

Start Page

26669

End Page

26684

ISBN (of the container)

979-889176256-5

Date Issued

2025-07

Author(s)

Chen, Yen-Shan

Jin, Jing

Kuo, Peng-Ting

Huang, Chao-Wei

Chen, Yun-Nung

DOI

10.18653/v1/2025.findings-acl.1369

URI

https://www.scopus.com/record/display.uri?eid=2-s2.0-105028573899&origin=resultslist

https://scholars.lib.ntu.edu.tw/handle/123456789/737222

Abstract

Recent studies have demonstrated that large language models (LLMs) exhibit significant biases in evaluation tasks, particularly in preferentially rating and favoring self-generated content. However, the extent to which this bias manifests in fact-oriented tasks, especially within retrieval-augmented generation (RAG) frameworks-where keyword extraction and factual accuracy take precedence over stylistic elements-remains unclear. Our study addresses this knowledge gap by simulating two critical phases of the RAG framework. In the first phase, LLMs evaluated human-authored and model-generated passages, emulating the pointwise reranking phase. The second phase involves conducting pairwise reading comprehension tests to simulate the generation phase. Contrary to previous findings indicating a self-preference in rating tasks, our results reveal no significant self-preference effect in RAG frameworks. Instead, we observe that factual accuracy significantly influences LLMs' output, even in the absence of prior knowledge. These findings are consistent among three common QA datasets (NQ, MARCO, TriviaQA Datasets) and 5 widely adopted language models (GPT-3.5, GPT-4o-mini, Gemini, LLaMA3, and Mistral). Our research contributes to the ongoing discourse on LLM biases and their implications for RAG-based system, offering insights that may inform the development of more robust and unbiased LLM systems.

Event(s)

63rd Annual Meeting of the Association for Computational Linguistics, ACL 2025

Publisher

Association for Computational Linguistics

Type

conference paper

LLMs are Biased Evaluators But Not Biased for Fact-Centric Retrieval Augmented Generation

關於 (About)

聯絡資訊 (Contact Us)

相關網站 (Useful Links)

關於開放取用 (Open Access, OA)

出版社期刊論文授權政策 (Copyright)

使用說明 (Instructions)

登入說明 (Sign-in)

匯入著作 (Submission)