Unpacking Human Preference for LLMs: Demographically Aware Evaluation with the HUMAINE Framework

Abstract

The evaluation of large language models faces significant challenges. Technical benchmarks often lack real-world relevance, while existing human preference evaluations suffer from unrepresentative sampling, superficial assessmentdepth, and single-metric reductionism. To address these issues, we introduce HUMAINE, a framework for multidimensional, demographically aware measurementof human-AI interaction. We collected multi-turn, naturalistic conversations from23,404 participants that were stratified across 22 demographic groups, both in theUS and UK, to evaluate 28 state-of-the-art models across five human-centric dimensions. We use a hierarchical Bayesian Bradley-Terry-Davidson (BTD) model,with post-stratification to census data, and our analysis reveals three key insights.(1) We establish a clear performance hierarchy where google/gemini-2.5-proranks first overall, with a 95.6% posterior probability of being the top-ranked model. (2) We uncover significant preference heterogeneity, with user age emerging as the primary demographic axis of disagreement; a model’s perceived rankcan shift substantially across age groups, exposing failures in generalisation thatunrepresentative samples typically mask. (3) We quantify the vast difference indiscriminative power across evaluation dimensions, with ambiguous qualities likeTrust, Ethics & Safety showing a 65% tie rate, in stark contrast to the decisive10% tie rate for Overall Winner. Our work emphasises the need for a more multidimensional, demographically aware perspective in LLM evaluation. We releaseour complete dataset, interactive leaderboard, and open-source framework.

--
Citations
Evaluation
Dataset
Dataset

Study specs

Multi-turn naturalistic conversations analyzed using a hierarchical Bayesian Bradley-Terry-Davidson model with post-stratification to census data, stratified across 22 demographic groups.

Institution
Prolific
Sample Size
N=23,404
Study Type
Evaluation Study
Year
2026
Human Data Platform
Prolific

Measured Outcomes

Performance of 28 LLMs across five human-centric dimensions, accounting for demographic-specific preferences.

Peer Review & Critical Discussion

3 threads

Potential Selection Bias in 2023 Cohort

DSJDr. Sarah J.
Verified PhD Candidate
12 replies

The participant pool shows a concerning overrepresentation of users from high-income demographics. Looking at Table 3, we can see that 78% of respondents had annual incomes above $75k, which significantly limits the generalizability of these findings to broader populations.

2 hours ago

Non-naive Participants Issue

MCM. Chen (OpenAI)
Data Scientist
8 replies

I've noticed a methodological concern regarding participant naivety. Given that Prolific users often complete multiple studies, there's a real risk that participants had prior exposure to similar experimental paradigms, which could confound the results.

5 hours ago

RLHF Applicability to This Study Design

PRWProf. R. Williams
Verified Researcher
15 replies

The implications for RLHF training pipelines are understated. If we accept the authors' conclusions about preference stability, this has direct consequences for how we should structure reward model training. The temporal decay effect described in Section 4.2 is particularly relevant.

1 day ago

Verify your expertise to join discussion

Create an account and verify your credentials to participate in peer discussions.