The Viability of Crowdsourcing for RAG Evaluation
Authors: L Gienapp, T Hagen, M Fröbe, M Hagen, B Stein, M Potthast, H Scells
Published: 2025
Publication: ArXiv
The study investigates the feasibility of using crowdsourcing for RAG evaluation, finding that human pairwise judgments are reliable and cost-effective compared to LLM-based or automated methods.
Methods: Two complementary studies on response writing and response utility judgment using 903 human-written and 903 LLM-generated responses for 301 topics; pairwise judgments across seven utility dimensions were collected via human and LLM evaluators.
Key Findings: Human effectiveness in writing and judging responses in RAG scenarios, considering discourse styles and utility dimensions like coverage and coherence.
Limitations: Limited scope of topics (301 in total) and discourse styles; reliance on pairwise judgment might not capture nuanced responses.
Institution: Bauhaus-Universitat Weimar,Friedrich-Schiller-Universitat Jena,Leipzig University,University of Kassel, ScaDS.AI, hessian.AI
Research Area: Crowdsourcing, RAG Evaluation, Artificial Intelligence, AI Evaluation, RAG
Discipline: Artificial Intelligence
Sample Size: 903 participants
Citations: 4