Medium Sample Studies

This page lists 39 peer-reviewed papers tagged with Medium Sample in the Prolific Citations Library, a curated collection of research powered by high-quality human data from Prolific.

Papers (20 of 39)

On the conversational persuasiveness of GPT-4

Authors: F Salvi, M Horta Ribeiro, R Gallotti, R West

Year: 2025

Published in: Nature Human Behaviour, 2025 - nature.com

Institution: EPFL, Fondazione Bruno Kessle, Princeton University

Research Area: Conversational Persuasion of LLM, Human-Computer Interaction, Behavioral Science, Large Language Models

Discipline: Behavioral Science

GPT-4 can use personalized arguments to be more persuasive in debates, outperforming humans in 64.4% of AI-human comparisons when personalization is applied.

Methods: Preregistered controlled study involving multiround debates with random assignment to conditions focusing on AI-human comparisons, personalization, and opinion strength.

Key Findings: Effectiveness of persuasion by GPT-4, especially when using personalized arguments, compared to humans in debates.

Citations: 65

Sample Size: 900
Can large language models assess personality from asynchronous video interviews? A comprehensive evaluation of validity, reliability, fairness, and rating patterns

Authors: T Zhang, A Koutsoumpis, JK Oostrom

Year: 2025

Published in: IEEE Transactions ..., 2024 - ieeexplore.ieee.org

Institution: Southeast University, Vrije Universiteit, Tilburg University

Research Area: LLM Personality Assessment, Human-AI Interaction, Large Language Models

Discipline: Human-AI Interaction, Social Science, Humanities

LLMs like GPT-3.5 and GPT-4 can rival or outperform task-specific AI models in assessing personality traits from asynchronous video interviews, but show uneven performance, low reliability, and potential biases, warranting cautious use in high-stakes scenarios.

Methods: The study evaluated GPT-3.5 and GPT-4 performance in assessing personality traits and interview performance using simulated AVI responses, comparing them with ratings from task-specific AI and human annotators.

Key Findings: Validity, reliability, fairness, and rating patterns of LLMs (GPT-3.5 and GPT-4) in personality assessment from asynchronous video interviews.

Citations: 31

Sample Size: 685
Impact of annotator demographics on sentiment dataset labeling

Authors: Y Ding, J You, TK Machulla, J Jacobs, P Sen

Year: 2025

Published in: Proceedings of the ..., 2022 - dl.acm.org

Institution: University of California Irvine, University of Florida, State University of New York at Buffalo, University of Waterloo, Virginia Tech

Research Area: Computational Social Science, Human-Computer Interaction, Sentiment Analysis

Discipline: Computational Social Science

Demographic differences among annotators significantly affect sentiment dataset labels, causing up to a 4.5% accuracy difference in sentiment prediction models.

Methods: Crowdsourced annotations from >1000 workers combined with demographic data; analysis of multimodal sentiment datasets and evaluation using machine learning models.

Key Findings: Impact of annotator demographics on sentiment labeling and its effect on model predictions.

DOI: https://doi.org/10.1145/3555632

Citations: 28

Sample Size: 1000
Multi-turn Evaluation of Anthropomorphic Behaviours in Large Language Models

Authors: L Ibrahim, C Akbulut, R Elasmar, C Rastogi, M Kahng, MR Morris, KR McKee, V Rieser, M Shanahan, L Weidinger

Year: 2025

Published in: arXiv preprint arXiv:2502.07077, 2025•arxiv.org

Institution: Google DeepMind, Google, University of Oxford

Research Area: Multimodal conversational AI, conversational AI, Evaluation methodology, benchmarking

Discipline: Computer Science, Natural Language Processing, Human-Computer Interaction

The paper evaluates anthropomorphic behaviors in SOTA LLMs through a multi-turn methodology, showing that such behaviors, including empathy and relationship-building, predominantly emerge after multiple interactions and influence user perceptions.

Methods: Multi-turn evaluation of 14 anthropomorphic behaviors using simulations of user interactions, validated by a large-scale human subject study.

Key Findings: Anthropomorphic behaviors in large language models, including relationship-building and pronoun usage, and their perception by users.

Citations: 26

Sample Size: 1101
How do people react to political bias in generative Artificial Intelligence?

Authors: U Messer

Year: 2025

Published in: Computers in Human Behavior: Artificial Humans, 2025 - Elsevier

Institution: Universität der Bundeswehr München

Research Area: Political Bias in Generative AI, Human-AI Interaction, Affective Computing, AI Bias

Discipline: Computer Science, Human-AI Interaction

People's acceptance and reliance on Generative AI (GAI) increase when they perceive alignment between their political orientation and the bias of GAI-generated content, leading to expanded trust in sensitive applications.

Methods: Three experiments analyzing behavioral reactions to politically biased content generated by GAI, including the impact of perceived alignment on acceptance and trust.

Key Findings: Participants' acceptance, reliance, and trust in GAI based on perceived alignment between political bias of GAI-generated content and their own political beliefs.

DOI: https://doi.org/10.1016/j.chbah.2024.100108

Citations: 24

Sample Size: 513
Impact of Tone-Aware Explanations in Recommender Systems

Authors: A Okoso, K Otaki, S Koide, Y Baba

Year: 2025

Published in: ACM Transactions on Recommender Systems, 2025•dl.acm.org

Institution: Toyota Central R and D Labs, Toyota

Research Area: Human-Computer Interaction

Discipline: Machine Learning, Artificial Intelligence

The study demonstrates that tailoring the tone of textual explanations in recommender systems to domains and user attributes, such as age and personality traits, can enhance users' perceptions and engagement.

Methods: Two online user studies: (1) 470 participants evaluated synthetic explanations with six tones across three domains (movies, hotels, and home products), (2) 103 participants engaged with a real-world dataset from the hotel domain using a personalized recommender system.

Key Findings: The perceived effects of different textual explanation tones on users, examined across domains (movies, hotels, home products) and user attributes (e.g., age, personality traits).

DOI: https://dl.acm.org/doi/10.1145/3718101

Citations: 13

Sample Size: 573
The Impact of Generative AI on Social Media: An Experimental Study

Authors: AG Møller, DM Romero, D Jurgens

Year: 2025

Published in: arXiv preprint arXiv ..., 2025 - arxiv.org

Institution: University of Copenhagen, University of Michigan, Pioneer Centre for AI

Research Area: Generative AI, Social Media, Human-Computer Interaction

Discipline: Computational Social Science

Generative AI tools on social media increase user engagement and content volume but reduce perceived quality and authenticity in discussions, highlighting challenges for ethical integration.

Methods: Controlled experiment with participants assigned to small discussion groups under distinct AI-assisted treatment conditions including chat assistance, conversation starters, feedback on comment drafts, and reply suggestions.

Key Findings: Impact of generative AI tools on user behavior, engagement, content volume, perceived quality, and authenticity in social media interactions.

DOI: https://doi.org/10.48550/arXiv.2506.14295

Citations: 9

Sample Size: 680
The Daily Lives of Crowdsourced US Respondents: A Time Use Comparison of MTurk, Prolific, and ATUS

Authors: RG Rinderknecht, L Doan

Year: 2025

Published in: Sociological ..., 2025 - journals.sagepub.com

Institution: RAND

Research Area: Crowdsourcing, Time Use Studies, Social Science

Discipline: Artificial Intelligence

Time use patterns of MTurk and Prolific respondents differ significantly from the general U.S. population (ATUS), including less housework and care work, more time at home and alone, even after accounting for demographic differences.

Methods: Time diaries were collected and analyzed for 136 MTurk and 156 Prolific respondents, then compared with 468 ATUS responses.

Key Findings: Daily time use patterns including work, housework, travel, leisure, and time spent alone or at home.

Citations: 6

Sample Size: 760
When AI is Fairer Than Humans: The Role of Egocentrism in Moral and Fairness Judgments of AI and Human Decisions

Authors: K Miazek, K Bocian

Year: 2025

Published in: Computers in Human Behavior Reports, 2025 - Elsevier

Institution: SWPS University

Research Area: Moral and Fairness Judgments of AI, Human Behavior, Egocentrism

Discipline: Social Science, Artificial Intelligence

The study found that egocentric biases influence fairness judgments, favoring decisions beneficial to self-interest, and that this bias is weaker for AI compared to human agents due to reduced perceived mind and liking for AI.

Methods: Three experiments with manipulated self-interest conditions analyzed perceptions of fairness and morality in decisions made by AI versus human agents using Prolific US samples.

Key Findings: Fairness and moral judgments in financial decision-making by AI and human agents, moderated by self-interest and social perceptions.

DOI: https://doi.org/10.1016/j.chbr.2025.100719

Citations: 6

Sample Size: 1880
Sycophantic AI decreases prosocial intentions and promotes dependence

Authors: M Cheng, C Lee, P Khadpe, S Yu, D Han

Year: 2025

Published in: arXiv preprint arXiv ..., 2025 - arxiv.org

Institution: Stanford University, Carnegie Mellon University

Research Area: Computer Science, Artificial Intelligence, Sycophancy.

Discipline: Computer Science, Psychology

The study shows that sycophantic AI, which validates user inputs unquestioningly, reduces people's prosocial behavior and fosters dependence, despite users perceiving such AI as higher quality and more trustworthy.

Methods: The researchers conducted two preregistered experiments including a live-interaction study, where participants discussed real interpersonal conflicts with AI models. They evaluated responses from 11 state-of-the-art AI models on levels of sycophancy and its psychological effects on users.

Key Findings: The prevalence of sycophantic behavior in AI, users' prosocial intentions, conviction of being in the right, trust in AI, and willingness to reuse sycophantic AI models.

Citations: 5

Sample Size: 1604
Impact of AI-Assisted Diagnosis on American Patients' Trust in and Intention to Seek Help From Health Care Professionals: Randomized, Web-Based Survey ...

Authors: C Chen, Z Cui

Year: 2025

Published in: Journal of Medical Internet Research, 2025 - jmir.org

Institution: Medical College of Wisconsin

Research Area: Trust in AI, AI-assisted diagnosis, Health communication, Healthcare human-AI interaction

Discipline: Digital Health, Human-Computer Interaction, Behavioral Science

Patients trust and are more likely to seek help from doctors explicitly avoiding AI-assisted diagnosis rather than those using extensive or moderate AI, highlighting a strong aversion to AI in healthcare settings.

Methods: A randomized, web-based 4-group survey experiment was conducted with controls for sociodemographic factors and analysis using regression, mediation, and moderation techniques.

Key Findings: Trust in and intention to seek medical help from health care professionals using AI-assisted diagnosis versus those avoiding AI, and the influence of demographic, social, and experiential factors.

DOI: https://doi.org/10.2196/66083

Citations: 4

Sample Size: 1762
The Viability of Crowdsourcing for RAG Evaluation

Authors: L Gienapp, T Hagen, M Fröbe, M Hagen, B Stein, M Potthast, H Scells

Year: 2025

Published in: ArXiv

Institution: Bauhaus-Universitat Weimar, Friedrich-Schiller-Universitat Jena, Leipzig University, University of Kassel, ScaDS.AI, hessian.AI

Research Area: Crowdsourcing, RAG Evaluation, Artificial Intelligence, AI Evaluation, RAG

Discipline: Artificial Intelligence

The study investigates the feasibility of using crowdsourcing for RAG evaluation, finding that human pairwise judgments are reliable and cost-effective compared to LLM-based or automated methods.

Methods: Two complementary studies on response writing and response utility judgment using 903 human-written and 903 LLM-generated responses for 301 topics; pairwise judgments across seven utility dimensions were collected via human and LLM evaluators.

Key Findings: Human effectiveness in writing and judging responses in RAG scenarios, considering discourse styles and utility dimensions like coverage and coherence.

Citations: 4

Sample Size: 903
Unlocking creativity with Artificial Intelligence: Field and experimental evidence on the Goldilocks (curvilinear) effect of human-AI collaboration.

Authors: HCB Huang

Year: 2025

Published in: Journal of Experimental Psychology: General, 2025 - psycnet.apa.org

Institution: University of British Columbia

Research Area: Human-AI Collaboration, Creativity, Experimental Psychology

Discipline: Experimental Psychology

Moderate levels of human-AI collaboration enhance creative performance due to increased knowledge diversity, but excessive or minimal involvement diminishes this effect.

Methods: Two experiments assigned 139 business professionals and 319 working adults to collaborate with ChatGPT at varying levels, and a follow-up survey among 188 creative industry workers was conducted to replicate findings.

Key Findings: The impact of varying degrees of human-AI collaboration on creative performance, evaluated by human judges, entrepreneurs, and AI metrics.

Citations: 3

Sample Size: 646
Lay Perceptions of Algorithmic Discrimination in the Context of Systemic Injustice

Authors: G Lima, N Grgić-Hlača, M Langer, Y Zou

Year: 2025

Published in: Proceedings of the 2025 CHI ..., 2025 - dl.acm.org

Institution: University of Maryland, Max Planck Institute, Stanford University, Cornell University

Research Area: Algorithmic Fairness, Systemic Injustice, Social Perception of AI, Algorithmic Discrimination

Discipline: Computational Social Science

The study examines how contextualizing algorithms within systemic injustice impacts perceptions of algorithmic discrimination, finding disparate effects based on participant group identity and revealing unintended consequences of such contextualization.

Methods: 2x3 between-participants experiment using the hiring context as a case-study; examined the influence of systemic injustice information and algorithms' bias perpetuation on lay perceptions.

Key Findings: Impact of systemic injustice framing and explanation of algorithmic bias perpetuation on participants' views of algorithmic fairness and discrimination.

DOI: 10.1145/3706598.3713536

Citations: 2

Sample Size: 716
Benchmarking World-Model Learning

Authors: A Warrier, D Nguyen, M Naim, M Jain, Y Liang, K Schroeder, C Yang, JB Tenenbaum, S Vollmer, K Ellis, Z Tavares

Year: 2025

Published in: 2025 - arXiv preprint arXiv …, 2025 - arxiv.org

Institution: Basis Research Institute, DFKI GmbH, Harvard University, Quebec AI Institute, University of Cambridge, Massachusetts Institute of Technology, Cornell University

Research Area: Agent learning, World Models, Benchmarking, Evaluation protocols, Reinforcement Learning from Human Feedback (RLHF), Large Language Models

Discipline: Computer Science, Artificial Intelligence, Machine Learning

The paper introduces WorldTest, a novel protocol for evaluating model-learning agents using reward-free exploration and behavior-based scoring, and demonstrates that humans outperform models on the AutumnBench suite of tasks, revealing significant gaps in world-model learning.

Methods: The authors proposed WorldTest, a protocol separating reward-free interaction from scored tests in related environments, with evaluations done using AutumnBench—a dataset of 43 grid-world environments and 129 tasks across prediction, planning, and causal dynamics.

Key Findings: Performance of model-learning agents and humans in acquiring world models for masked-frame prediction, planning, and understanding causal dynamics.

Citations: 1

Sample Size: 517
Caution when Crowdsourcing: Prolific as a Superior Platform Compared with MTurk

Authors: D OConnell, A Bautista

Year: 2025

Published in: ... Student Journal of ..., 2025 - journals.library.columbia.edu

Institution: University of Houston, Webster University

Research Area: Crowdsourcing Research Methodology, Human-Computer Interaction

Discipline: Computational Social Science, Behavioral Research Methods

Prolific outperforms MTurk in participant data quality and affordability for online survey-based research.

Methods: Data from participants recruited via MTurk and Prolific were analyzed for cost, attention measures, participation duration, and internal consistency.

Key Findings: Comparison of data quality and cost-effectiveness between MTurk and Prolific for online survey recruitment.

Citations: 1

Sample Size: 699
Factors Shaping Perceptions of AI Tools Among a Nationally Representative Sample of US Adults

Authors: B Katz, N Abdelgawad, D Friedberg, P Roberts, S Misra

Year: 2025

Published in: Innovation in Aging, 2025•pmc.ncbi.nlm.nih.gov

Institution: Virginia Tech

Research Area: Human–AI Interaction (HCI), Technology Perception

Discipline: Behavioral Science

Age significantly influences perceptions of generative AI tools, with older individuals perceiving more benefits and fewer risks compared to younger individuals; thinking dispositions also play a role.

Methods: A nationally representative survey of US adults conducted via the Prolific platform using various AI-relevant scales, including attitudes, risks, benefits, frequency of use, expertise, and literacy assessments.

Key Findings: Demographic factors, industry types, thinking dispositions, and attitudes toward generative AI tools, including risk and utility perceptions.

Citations: 1

Sample Size: 500
Scaling Laws for Economic Productivity: Experimental Evidence in LLM‑Assisted Consulting, Data Analyst, and Management Tasks

Authors: Ali Merali

Year: 2025

Published in: ArXiv

Institution: Yale University

Research Area: LLM-Assisted Economic Productivity, Consulting, Data Analysis

Discipline: Economics, Artificial Intelligence

The paper identifies scaling laws linking LLM training compute to professional productivity gains, showing an 8% annual reduction in task time influenced by both compute and algorithmic advances, but with uneven impacts across task types.

Methods: A preregistered experiment involving professional tasks completed by consultants, data analysts, and managers using 13 different LLMs.

Key Findings: Economic productivity impacts of LLMs in professional settings, time savings across task categories, and contribution of compute versus algorithmic progress.

Citations: 1

Sample Size: 500
To Mask or to Mirror: Human-AI Alignment in Collective Reasoning

Authors: C Qian, AT Parisi, C Bouleau, V Tsai

Year: 2025

Published in: Proceedings of the ..., 2025 - aclanthology.org

Institution: Google, Google DeepMind

Research Area: Human-AI Alignment, Collective Reasoning, Social Biases, LLM Simulation of Human Behavior, AI Bias

Discipline: Natural Language Processing, Artificial Intelligence, Computational Social Science

This study examines human-AI alignment in collective reasoning using an empirical framework, demonstrating how LLMs either mirror or mask human biases depending on context, cues, and model-specific inductive biases.

Methods: The study uses the Lost at Sea social psychology task in a large-scale online experiment, simulating LLM groups conditioned on human decision-making data across varying conditions of visible or pseudonymous demographics.

Key Findings: Alignment of LLM behavior with human social reasoning, focusing on collective decision-making and biases in group interactions.

Citations: 1

Sample Size: 748
Whose view of safety? a deep dive dataset for pluralistic alignment of text-to-image models

Authors: C Rastogi, TH Teh, P Mishra, R Patel, D Wang, M Díaz, A Parrish, AM Davani, Z Ashwood

Year: 2025

Published in: arXiv preprint arXiv:2507.13383, 2025•arxiv.org

Institution: Google DeepMind, Google Research, Google

Research Area: AI alignment, safety evaluation, AI Safety, Multimodal evaluation, Human-AI Interaction, Large Language Models

Discipline: Computer Science, Machine Learning, Artificial Intelligence

This research introduces the DIVE dataset to enable pluralistic alignment in text-to-image models by accounting for diverse safety perspectives, revealing demographic variations in harm perception and advancing T2I model alignment strategies.

Methods: The study involved collecting feedback across 1000 prompts from demographically intersectional human raters to capture diverse safety perspectives, with an emphasis on empirical and contextual differences in harm perception.

Key Findings: Safety perceptions of text-to-image (T2I) model outputs from diverse demographic viewpoints and the influence of these perspectives on alignment strategies.

Citations: 1

Sample Size: 1000