Real-World Summarization: When Evaluation Reaches Its Limits
Authors: P Schmidtová, O Dušek, S Mahamood
Published: 2025
Publication: ArXiv
Simpler metrics like word overlap surprisingly correlate well with human judgments in summarization evaluation, outperforming complex methods in out-of-domain applications, though LLMs remain unreliable for assessment due to annotation biases.
Methods: Human evaluation campaigns with categorical error assessment, span-level annotations, and comparison of traditional metrics, trainable models, and LLM-as-a-judge approaches.
Key Findings: Effectiveness of summarization evaluation methods and their correlation with human judgment, along with business impacts of incorrect information in generated summaries.
Limitations: LLMs under- or over-annotate during evaluation and crowdsourced approaches face challenges; domain-specific variations in method performance.
Institution: Charles University,Trivago
Research Area: Summarization evaluation, Natural Language Processing, LLM-as-a-Judge, AI Evaluation
Discipline: Natural Language Processing
Citations: 1