Authors: L Gienapp, T Hagen, M Fröbe, M Hagen, B Stein, M Potthast, H Scells
Year: 2025
Published in: ArXiv
Institution: Bauhaus-Universitat Weimar, Friedrich-Schiller-Universitat Jena, Leipzig University, University of Kassel, ScaDS.AI, hessian.AI
Research Area: Crowdsourcing, RAG Evaluation, Artificial Intelligence, AI Evaluation, RAG
Discipline: Artificial Intelligence
The study investigates the feasibility of using crowdsourcing for RAG evaluation, finding that human pairwise judgments are reliable and cost-effective compared to LLM-based or automated methods.
Methods: Two complementary studies on response writing and response utility judgment using 903 human-written and 903 LLM-generated responses for 301 topics; pairwise judgments across seven utility dimensions were collected via human and LLM evaluators.
Key Findings: Human effectiveness in writing and judging responses in RAG scenarios, considering discourse styles and utility dimensions like coverage and coherence.
Citations: 4
Sample Size: 903
Authors: Pooja S. B. Rao, Sanja Šćepanović, Ke Zhou, Edyta Paulina Bogucka, D Quercia
Year: 2025
Published in: ArXiv
Institution: Nokia Bell Labs, University of Lausanne
Research Area: AI Risk Management, Model Risk Reporting, RAG Pipeline, RAG
Discipline: Artificial Intelligence
RiskRAG improves AI model risk reporting by offering pre-populated, contextualized risk reports that are preferred by developers, designers, and media professionals over standard model cards.
Methods: Developed a Retrieval Augmented Generation system based on five design requirements co-created with 16 developers, using a dataset of 450K model cards and 600 real-world incidents. Evaluated RiskRAG in preliminary and final studies with a total of 125 participants.
Key Findings: Effectiveness of RiskRAG in improving risk reporting and decision-making compared to standard model cards.
Sample Size: 125