Language Model Evaluation: Research Area — Prolific Citations Library

Discover 3 peer-reviewed studies in Language Model Evaluation (2024–2026). Explore research findings powered by Prolific's diverse participant panel.

This page lists 3 peer-reviewed papers in the research area of Language Model Evaluation in the Prolific Citations Library, a curated collection of research powered by high-quality human data from Prolific.

Papers (3 of 3)

Bayesian teaching enables probabilistic reasoning in large language models

Authors: L Qiu, F Sha, K Allen, Y Kim, T Linzen, S van Steenkiste

Year: 2026

Published in: Nature …, 2026 - nature.com

Institution: Meta, Google DeepMind, Massachusetts Institute of Technology, Google Research, Google

Research Area: Probabilistic reasoning, Bayesian cognition, Neural language models, Reasoning, AI Evaluations

Discipline: Machine learning, Artificial intelligence

This paper sits at the intersection of machine learning and computational cognitive science, showing that large language models can acquire generalized probabilistic reasoning by being trained to imitate Bayesian belief updating rather than relying on prompting or heuristics.

Citations: 8
Trick or Neat: Adversarial Ambiguity and Language Model Evaluation

Authors: A Karamolegkou, O Eberle, P Rust, C Kauf, A Søgaard

Year: 2025

Published in: ArXiv

Institution: Aleph Alpha, Massachusetts Institute of Technology

Research Area: Adversarial Ambiguity, Language Model Evaluation, Artificial intelligence, Computation and Language, LLM, AI Evaluation, Red Teaming

Discipline: Natural Language Processing

The paper assesses language models' sensitivity to ambiguity using an adversarial dataset and finds that direct prompting poorly identifies ambiguity, while linear probes achieve high accuracy in decoding ambiguity from model representations.

Methods: An adversarial ambiguity dataset was introduced with various types of ambiguities and transformations; models were tested using direct prompts and linear probes trained on internal representations.

Key Findings: Language models' ability to detect ambiguity, including syntactic, lexical, and phonological types, as well as performance under adversarial variations.

Citations: 2
ELITR-Bench: A Meeting Assistant Benchmark for Long-Context Language Models

Authors: Thibaut Thonet, Jos Rozen, Laurent Besacier

Year: 2024

Published in: ArXiv

Institution: NAVER Labs

Research Area: Long-Context Language Models, Meeting Assistant Systems, Benchmark Evaluation

Discipline: Artificial Intelligence