Unpacking Human Preference for LLMs: Demographically Aware Evaluation with the HUMAINE Framework
Abstract
The evaluation of large language models faces significant challenges. Technical benchmarks often lack real-world relevance, while existing human preference evaluations suffer from unrepresentative sampling, superficial assessmentdepth, and single-metric reductionism. To address these issues, we introduce HUMAINE, a framework for multidimensional, demographically aware measurementof human-AI interaction. We collected multi-turn, naturalistic conversations from23,404 participants that were stratified across 22 demographic groups, both in theUS and UK, to evaluate 28 state-of-the-art models across five human-centric dimensions. We use a hierarchical Bayesian Bradley-Terry-Davidson (BTD) model,with post-stratification to census data, and our analysis reveals three key insights.(1) We establish a clear performance hierarchy where google/gemini-2.5-proranks first overall, with a 95.6% posterior probability of being the top-ranked model. (2) We uncover significant preference heterogeneity, with user age emerging as the primary demographic axis of disagreement; a model’s perceived rankcan shift substantially across age groups, exposing failures in generalisation thatunrepresentative samples typically mask. (3) We quantify the vast difference indiscriminative power across evaluation dimensions, with ambiguous qualities likeTrust, Ethics & Safety showing a 65% tie rate, in stark contrast to the decisive10% tie rate for Overall Winner. Our work emphasises the need for a more multidimensional, demographically aware perspective in LLM evaluation. We releaseour complete dataset, interactive leaderboard, and open-source framework.
Study specs
Multi-turn naturalistic conversations analyzed using a hierarchical Bayesian Bradley-Terry-Davidson model with post-stratification to census data, stratified across 22 demographic groups.
- Institution
- Prolific
- Discipline
- Machine Learning,Artificial Intelligence
- Sample Size
- N=23,404
- Study Type
- Evaluation Study
- Year
- 2026
- Human Data Platform
- Prolific
- Source
- View Source Google Scholar
Measured Outcomes
Performance of 28 LLMs across five human-centric dimensions, accounting for demographic-specific preferences.
Peer Review & Critical Discussion
Potential Selection Bias in 2023 Cohort
The participant pool shows a concerning overrepresentation of users from high-income demographics. Looking at Table 3, we can see that 78% of respondents had annual incomes above $75k, which significantly limits the generalizability of these findings to broader populations.
Non-naive Participants Issue
I've noticed a methodological concern regarding participant naivety. Given that Prolific users often complete multiple studies, there's a real risk that participants had prior exposure to similar experimental paradigms, which could confound the results.
RLHF Applicability to This Study Design
The implications for RLHF training pipelines are understated. If we accept the authors' conclusions about preference stability, this has direct consequences for how we should structure reward model training. The temporal decay effect described in Section 4.2 is particularly relevant.
Verify your expertise to join discussion
Create an account and verify your credentials to participate in peer discussions.