Model Evaluation: Research Area — Prolific Citations Library

Discover 6 peer-reviewed studies in Model Evaluation (2018–2026). Explore research findings powered by Prolific's diverse participant panel.

This page lists 6 peer-reviewed papers in the research area of Model Evaluation in the Prolific Citations Library, a curated collection of research powered by high-quality human data from Prolific.

Papers (6 of 6)

Bayesian teaching enables probabilistic reasoning in large language models

Authors: L Qiu, F Sha, K Allen, Y Kim, T Linzen, S van Steenkiste

Year: 2026

Published in: Nature …, 2026 - nature.com

Institution: Meta, Google DeepMind, Massachusetts Institute of Technology, Google Research, Google

Research Area: Probabilistic reasoning, Bayesian cognition, Neural language models, Reasoning, AI Evaluations

Discipline: Machine learning, Artificial intelligence

This paper sits at the intersection of machine learning and computational cognitive science, showing that large language models can acquire generalized probabilistic reasoning by being trained to imitate Bayesian belief updating rather than relying on prompting or heuristics.

Citations: 8
Trick or Neat: Adversarial Ambiguity and Language Model Evaluation

Authors: A Karamolegkou, O Eberle, P Rust, C Kauf, A Søgaard

Year: 2025

Published in: ArXiv

Institution: Aleph Alpha, Massachusetts Institute of Technology

Research Area: Adversarial Ambiguity, Language Model Evaluation, Artificial intelligence, Computation and Language, LLM, AI Evaluation, Red Teaming

Discipline: Natural Language Processing

The paper assesses language models' sensitivity to ambiguity using an adversarial dataset and finds that direct prompting poorly identifies ambiguity, while linear probes achieve high accuracy in decoding ambiguity from model representations.

Methods: An adversarial ambiguity dataset was introduced with various types of ambiguities and transformations; models were tested using direct prompts and linear probes trained on internal representations.

Key Findings: Language models' ability to detect ambiguity, including syntactic, lexical, and phonological types, as well as performance under adversarial variations.

Citations: 2
Benchmarking World-Model Learning

Authors: A Warrier, D Nguyen, M Naim, M Jain, Y Liang, K Schroeder, C Yang, JB Tenenbaum, S Vollmer, K Ellis, Z Tavares

Year: 2025

Published in: 2025 - arXiv preprint arXiv …, 2025 - arxiv.org

Institution: Basis Research Institute, DFKI GmbH, Harvard University, Quebec AI Institute, University of Cambridge, Massachusetts Institute of Technology, Cornell University

Research Area: Agent learning, World Models, Benchmarking, Evaluation protocols, RLHF, LLM

Discipline: Computer Science, Artificial Intelligence, Machine Learning

The paper introduces WorldTest, a novel protocol for evaluating model-learning agents using reward-free exploration and behavior-based scoring, and demonstrates that humans outperform models on the AutumnBench suite of tasks, revealing significant gaps in world-model learning.

Methods: The authors proposed WorldTest, a protocol separating reward-free interaction from scored tests in related environments, with evaluations done using AutumnBench—a dataset of 43 grid-world environments and 129 tasks across prediction, planning, and causal dynamics.

Key Findings: Performance of model-learning agents and humans in acquiring world models for masked-frame prediction, planning, and understanding causal dynamics.

Citations: 1

Sample Size: 517
ImagenHub: Standardizing the evaluation of conditional image generation models

Authors: M Ku, T Li, K Zhang, Y Lu, X Fu, W Zhuang

Year: 2024

Published in: - arXiv preprint arXiv …, 2023 - arxiv.org

Institution: University of Waterloo, Ohio State University, University of California Santa Barbara, University of Pensylvania

Research Area: AI alignment, Representation learning, Cognitive computational modeling, Vision foundation models evaluation, Multimodal, Vision models

Discipline: Computer Science, Artificial Intelligence, Machine Learning

This paper presents a method for **aligning machine vision model representations with human visual similarity judgments across different abstraction levels, improving how well models reflect human perceptual and conceptual organization and enhancing generalization and uncertainty prediction.

DOI: https://doi.org/10.48550/arXiv.2310.01596

Citations: 59
ELITR-Bench: A Meeting Assistant Benchmark for Long-Context Language Models

Authors: Thibaut Thonet, Jos Rozen, Laurent Besacier

Year: 2024

Published in: ArXiv

Institution: NAVER Labs

Research Area: Long-Context Language Models, Meeting Assistant Systems, Benchmark Evaluation

Discipline: Artificial Intelligence
Making better use of the crowd: How crowdsourcing can advance machine learning research

Authors: JW Vaughan

Year: 2018

Published in: Journal of Machine Learning Research, 2018 - jmlr.org

Institution: Microsoft Research

Research Area: Crowdsourcing for Machine Learning Research, including data generation, model evaluation, hybrid intelligence systems, behavioral experiments.

Discipline: Machine Learning

Citations: 264