Ai Safety Studies

This page lists 43 peer-reviewed papers tagged with Ai Safety in the Prolific Citations Library, a curated collection of research powered by high-quality human data from Prolific.

Papers (20 of 43)

Unpacking Human Preference for LLMs: Demographically Aware Evaluation with the HUMAINE Framework

Authors: N Petrova, A Gordon, E Blindow

Year: 2026

Published in: Open review

Institution: Prolific

Research Area: Human-centered AI evaluation, Bayesian statistics, Responsible AI, AI alignment, LLM Evaluation

Discipline: Machine Learning, Artificial Intelligence

The study introduces HUMAINE, a multidimensional evaluation framework for LLMs, revealing demographic-specific preference variations and ranking google/gemini-2.5-pro as the top-performing model with a posterior probability of 95.6%.

Methods: Multi-turn naturalistic conversations analyzed using a hierarchical Bayesian Bradley-Terry-Davidson model with post-stratification to census data, stratified across 22 demographic groups.

Key Findings: Performance of 28 LLMs across five human-centric dimensions, accounting for demographic-specific preferences.

Sample Size: 23404
People Overtrust AI-Generated Medical Advice despite Low Accuracy

Authors: S Shekar, P Pataranutaporn, C Sarabu, GA Cecchi

Year: 2025

Published in: NEJM AI, 2025 - ai.nejm.org

Institution: MIT Media Lab, IBM Research, Stanford University, Massachusetts Institute of Technology

Research Area: AI Ethics, Healthcare, Patient Trust, Medical Misinformation

Discipline: Artificial Intelligence, Human-Computer Interaction, AI Ethics

This paper discusses a study by MIT researchers detailing patient trust in AI-generated medical advice, even when that advice is incorrect, raising concerns about misinformation in healthcare.

Citations: 19
REL-AI: An interaction-centered approach to measuring human-lm reliance

Authors: K Zhou, JD Hwang, X Ren, N Dziri

Year: 2025

Published in: Proceedings of the ..., 2025 - aclanthology.org

Institution: Stanford University, University of Southern California, Carnegie Mellon University, Allen Institute for AI

Research Area: Human-LM Reliance, Interaction-Centered Framework, Human-Computer Interaction

Discipline: Human-Computer Interaction, Artificial Intelligence

The study introduces Rel-A.I., an interaction-centered evaluation approach to measure human reliance on LLM responses, revealing that politeness and interaction context significantly influence user reliance.

Methods: Nine user studies were conducted, analyzing user reliance influenced by LLM communication features such as politeness and context through participant interaction experiments.

Key Findings: The degree of human reliance on LLM responses based on communication style (e.g., politeness) and interaction context (e.g., knowledge domain, prior interactions).

Citations: 18

Sample Size: 450
Helpful, harmless, honest? Sociotechnical limits of AI alignment and safety through Reinforcement Learning from Human Feedback: AD Lindström et al.

Authors: A Dahlgren Lindström, L Methnani, L Krause

Year: 2025

Published in: Ethics and Information ..., 2025 - Springer

Institution: Umeå University, Vrije Universiteit Amsterdam

Research Area: AI Alignment, AI Safety, Reinforcement Learning from Human Feedback (RLHF), Sociotechnical Systems

Discipline: Artificial Intelligence, Ethics

The paper critiques AI alignment efforts using RLHF and RLAIF, highlighting theoretical and practical limitations in meeting the goals of helpfulness, harmlessness, and honesty, and advocates for a broader sociotechnical approach to AI safety and ethics.

Methods: Sociotechnical critique of RLHF techniques with an analysis of theoretical frameworks and practical implementations.

Key Findings: The alignment of AI systems with human values and the efficacy of RLHF techniques in achieving the HHH principle (helpfulness, harmlessness, honesty).

DOI: https://doi.org/10.1007/s10676-025-09837-2

Citations: 14
Aligning machine and human visual representations across abstraction levels

Authors: L Muttenthaler, K Greff, F Born, B Spitzer, S Kornblith

Year: 2025

Published in: Nature, 2025 - nature.com

Institution: Google DeepMind, Google, Machine Learning Group, Technische Universität Berlin, BIFOLD, Berlin Institute for the Foundations of Learning and Data, Max Planck Institute

Research Area: Cognitive Alignment, Computer Vision, Multi-level Conceptual Knowledge

Discipline: Artificial Intelligence, Cognitive Science

This paper presents a method for **aligning machine vision model representations with human visual similarity judgments across different abstraction levels, improving how well models reflect human perceptual and conceptual organization and enhancing generalization and uncertainty prediction.

Citations: 11
Sycophantic AI decreases prosocial intentions and promotes dependence

Authors: M Cheng, C Lee, P Khadpe, S Yu, D Han

Year: 2025

Published in: arXiv preprint arXiv ..., 2025 - arxiv.org

Institution: Stanford University, Carnegie Mellon University

Research Area: Computer Science, Artificial Intelligence, Sycophancy.

Discipline: Computer Science, Psychology

The study shows that sycophantic AI, which validates user inputs unquestioningly, reduces people's prosocial behavior and fosters dependence, despite users perceiving such AI as higher quality and more trustworthy.

Methods: The researchers conducted two preregistered experiments including a live-interaction study, where participants discussed real interpersonal conflicts with AI models. They evaluated responses from 11 state-of-the-art AI models on levels of sycophancy and its psychological effects on users.

Key Findings: The prevalence of sycophantic behavior in AI, users' prosocial intentions, conviction of being in the right, trust in AI, and willingness to reuse sycophantic AI models.

Citations: 5

Sample Size: 1604
Trick or Neat: Adversarial Ambiguity and Language Model Evaluation

Authors: A Karamolegkou, O Eberle, P Rust, C Kauf, A Søgaard

Year: 2025

Published in: ArXiv

Institution: Aleph Alpha, Massachusetts Institute of Technology

Research Area: Adversarial Ambiguity, Language Model Evaluation, Artificial Intelligence, Natural Language Processing, Large Language Models, AI Evaluation, Red Teaming

Discipline: Natural Language Processing

The paper assesses language models' sensitivity to ambiguity using an adversarial dataset and finds that direct prompting poorly identifies ambiguity, while linear probes achieve high accuracy in decoding ambiguity from model representations.

Methods: An adversarial ambiguity dataset was introduced with various types of ambiguities and transformations; models were tested using direct prompts and linear probes trained on internal representations.

Key Findings: Language models' ability to detect ambiguity, including syntactic, lexical, and phonological types, as well as performance under adversarial variations.

Citations: 2
Evaluating LLM-contaminated Crowdsourcing Data Without Ground Truth

Authors: Y Zhang, J Pang, Z Zhu, Y Liu

Year: 2025

Published in: arXiv preprint arXiv:2506.06991, 2025 - arxiv.org

Institution: Rutgers University, University of California Santa Cruz

Research Area: Artificial Intelligence, Computational Social Science

Discipline: Computational Social Science

The paper proposes a training-free scoring mechanism using peer prediction to detect and mitigate LLM-assisted cheating in crowdsourced annotation tasks, with theoretical guarantees and empirical validation.

Methods: A peer prediction-based mechanism quantifies correlations between worker answers while conditioning on LLM-generated labels, without requiring ground truth or high-dimensional training data.

Key Findings: Detection of LLM-assisted low-effort cheating in crowdsourced annotation tasks, focusing on theoretical effectiveness and empirical robustness.

DOI: https://doi.org/10.48550/arXiv.2506.06991

Citations: 1
Locating Risk: Task Designers and the Challenge of Risk Disclosure in RAI Content Work

Authors: A Qian, R Shaw, L Dabbish, J Suh, H Shen

Year: 2025

Published in: arXiv preprint arXiv ..., 2025 - arxiv.org

Institution: Carnegie Mellon University, University of Pittsburgh, University of Utah, Yale School of Medicine, Yale University

Research Area: Responsible AI, Content Moderation, Risk Disclosure, Worker Well-being in Human-Computer Interaction (HCI).

Discipline: Computational Social Science, Human-Computer Interaction

The paper examines how task designers approach well-being risk disclosure in Responsible AI (RAI) content work, highlighting a need for better frameworks to communicate such risks effectively.

Methods: Interviews were conducted with 23 task designers from academic and industry sectors to gather insights on risk recognition, interpretation, and communication practices.

Key Findings: How task designers recognize, interpret, and communicate well-being risks in RAI content work.

Citations: 1

Sample Size: 23
Whose view of safety? a deep dive dataset for pluralistic alignment of text-to-image models

Authors: C Rastogi, TH Teh, P Mishra, R Patel, D Wang, M Díaz, A Parrish, AM Davani, Z Ashwood

Year: 2025

Published in: arXiv preprint arXiv:2507.13383, 2025•arxiv.org

Institution: Google DeepMind, Google Research, Google

Research Area: AI alignment, safety evaluation, AI Safety, Multimodal evaluation, Human-AI Interaction, Large Language Models

Discipline: Computer Science, Machine Learning, Artificial Intelligence

This research introduces the DIVE dataset to enable pluralistic alignment in text-to-image models by accounting for diverse safety perspectives, revealing demographic variations in harm perception and advancing T2I model alignment strategies.

Methods: The study involved collecting feedback across 1000 prompts from demographically intersectional human raters to capture diverse safety perspectives, with an emphasis on empirical and contextual differences in harm perception.

Key Findings: Safety perceptions of text-to-image (T2I) model outputs from diverse demographic viewpoints and the influence of these perspectives on alignment strategies.

Citations: 1

Sample Size: 1000
Broadening AI Access Through Human-Centered Natural Language Interfaces

Authors: K Zhou

Year: 2025

Published in: 2025 - search.proquest.com

Institution: Stanford University

Research Area: Human-Centered Natural Language Interfaces (NLI)

Discipline: Artificial Intelligence

The research explores how to safely design natural language interfaces in AI by identifying safety risks, proposing a harm-focused evaluation framework, and advocating for a broader consideration of user needs.

Methods: The study includes a review of LLM safety risks, development of a harm-based evaluation framework, and conceptual exploration of broadening NLP research to underrepresented user needs.

Key Findings: Safety risks in LLM communication, behavioral impacts of human-LM interactions, and gaps in NLP addressing diverse user needs.
Improving Human-AI Coordination through Adversarial Training and Generative Models

Authors: Paresh Chaudhary, Yancheng Liang, Daphne Chen, Simon S. Du, Natasha Jaques

Year: 2025

Published in: ArXiv

Institution: University of Washington

Research Area: Human-AI Coordination, Zero-Shot Coordination, Adversarial Training, Generative Models

Discipline: Artificial Intelligence, Human-Computer Interaction

The paper introduces GOAT, a novel framework combining pretrained generative models and adversarial training to improve human-AI coordination, achieving state-of-the-art performance on the Overcooked benchmark with real human partners.

Methods: The study utilized a frozen pretrained generative model to simulate cooperative agent policies and applied adversarial training to dynamically generate challenging human-AI interaction scenarios for training.

Key Findings: The effectiveness of GOAT in generalizing human-AI coordination strategies and its performance on the Overcooked benchmark.
LLM-based Semantic Augmentation for Harmful Content Detection

Authors: Elyas Meguellati1, Assad Zeghina2, Shazia Sadiq1, Gianluca Demartini1

Year: 2025

Published in: ArXiv

Institution: University of Queensland, University of Strasbourg

Research Area: Natural Language Processing, Harmful Content Detection

Discipline: Natural Language Processing

The paper introduces an approach using LLM-based semantic augmentation for harmful content detection on social media, achieving performance comparable to human-annotated models but at reduced cost.

Methods: The researchers utilize LLMs to clean noisy text and generate explanations for context-rich preprocessing, then evaluate the augmented training sets on multiple high-context datasets such as SemEval 2024 Persuasive Meme, Google Jigsaw toxic comments, and Facebook hateful memes datasets.

Key Findings: The efficacy of LLM-based semantic augmentation in enhancing training sets for social media tasks such as propaganda detection, hateful meme classification, and toxicity identification.
Leveraging Social Media and Crowdsourcing to Recruit and Retain Military Veterans With Posttraumatic Stress Disorder or Experience of Harmful Gambling ...

Authors: C Heath, JM Williams, D Leightley

Year: 2025

Published in: JMIR mHealth and ..., 2025 - mhealth.jmir.org

Institution: Swansea University, King's College London, Reykjavík University

Research Area: mHealth Interventions, Crowdsourcing, Social Media Recruitment, Mental Health Research (PTSD, Harmful Gambling)

Discipline: Digital Health, Mental Health Research

Social media and online platforms like Facebook and Prolific were effective but faced challenges in recruiting and retaining military veterans with PTSD or harmful gambling for a digital mHealth intervention pilot study.

Methods: Multiple recruitment strategies were used, including paid and unpaid advertisements on Facebook, Prolific, direct mailing, event hosting with veterans' charities, snowball sampling, and incentives.

Key Findings: The effectiveness of different recruitment strategies for enrolling military veterans with PTSD or harmful gambling into a digital intervention study.

Sample Size: 79
Prevalence of Security and Privacy Risk-Inducing Usage of AI-based Conversational Agents

Authors: K Grosse, N Ebert

Year: 2025

Published in: ARXIV

Institution: IBM Research, ZHAW

Research Area: Security and privacy risks, Large Language Models, Human-AI Interaction, AI Safety

Discipline: Computer Science

A survey of 3,270 UK adults reveals significant security and privacy risks in AI conversational agent usage, with a third engaging in risky behavior enabling attacks and many unaware of how their data are used or opting out.

Methods: Representative survey conducted via Prolific platform targeting UK adults, focusing on usage behaviors of AI conversational agents.

Key Findings: User behaviors related to security and privacy risks, data sanitization practices, attempts to jailbreak AI models, and awareness of data usage policies.

Sample Size: 3270
The inadequacy of reinforcement learning from human feedback - radicalizing large language models via semantic vulnerabilities

Authors: TR McIntosh, T Susnjak, T Liu, P Watters

Year: 2024

Published in: ... on Cognitive and ..., 2024 - ieeexplore.ieee.org

Institution: Cyberoo, Massey University, Cyberstronomy, RMIT University

Research Area: Semantic Vulnerabilities in LLMs, Ideological Manipulation, Reinforcement Learning from Human Feedback (RLHF) Limitations

Discipline: Computer Science, Artificial Intelligence, Machine Learning

RLHF mechanisms are insufficient to prevent semantic manipulation of LLMs, allowing them to express extreme ideological viewpoints when subjected to targeted conditioning techniques.

Methods: Psychological semantic conditioning techniques were applied to assess the susceptibility of LLMs to ideological manipulation.

Key Findings: The ability of LLMs to resist or adopt extreme ideological viewpoints under semantic conditioning.

Citations: 219
How do citizens perceive the use of Artificial Intelligence in public sector decisions?

Authors: T Haesevoets, B Verschuere, R Van Severen

Year: 2024

Published in: Government Information ..., 2024 - Elsevier

Institution: Ghent University, KU Leuven

Research Area: Public Sector AI, Citizen Perception, AI Ethics, Transparency

Discipline: Political Science, Public Administration

Citizens in the UK prefer AI to play a supporting role in public sector decisions rather than making decisions autonomously, with greater acceptance for low ideologically-charged contexts.

Methods: Three studies surveying UK respondents on their perceptions of AI involvement in public sector decision-making.

Key Findings: Perception of AI's role in decision-making, its legitimacy compared to human decision-makers, and suitability for various types of decisions.

DOI: https://doi.org/10.1016/j.giq.2023.101906

Citations: 54
A Robot Walks into a Bar: Can Language Models Serve as Creativity Support Tools for Comedy? An Evaluation of LLMs’ Humour Alignment with Comedians

Authors: PW Mirowski, J Love, K Mathewson, S Mohamed

Year: 2024

Published in: ArXiv

Institution: Google DeepMind, Google

Research Area: AI Creativity, Humor Generation, Human-Computer Interaction

Discipline: Artificial Intelligence

Professional comedians found LLMs insufficient as creativity support tools for comedy, citing bias, bland output, and reinforcement of hegemonic viewpoints.

Methods: Workshops conducted with professional comedians combining comedy writing sessions using LLMs, a Creativity Support Index questionnaire, and focus groups discussing their experiences and ethical concerns.

Key Findings: Effectiveness of LLMs as creativity support tools for comedy writing, ethical concerns (bias, censorship, copyright), and value alignment in AI outputs.

Citations: 52

Sample Size: 20
Organic or diffused: Can we distinguish human art from ai-generated images?

Authors: AYJ Ha, J Passananti, R Bhaskar, S Shan

Year: 2024

Published in: Proceedings of the ..., 2024 - dl.acm.org

Institution: University of California Santa Barbara, The University of Chicago, Institute of Education, University College London

Research Area: Human-Computer Interaction, Generative AI, Digital Forensics

Discipline: Human-Computer Interaction, Generative AI, Digital Forensics

The paper investigates the effectiveness of different approaches, including both human and automated detectors, in distinguishing human art from AI-generated images, finding that a combination of methods offers the best performance despite persistent weaknesses.

Methods: Comparison of human art across 7 styles with AI-generated images from 5 generative models, assessed using 5 automated detectors and 3 human groups (crowdworkers, professional artists, expert artists).

Key Findings: Detection accuracy and robustness of human and automated methods in identifying AI-generated images under benign and adversarial conditions.

DOI: 10.1145/3658644.3670306

Citations: 52

Sample Size: 3993
First-person fairness in chatbots

Authors: T Eloundou, A Beutel, DG Robinson

Year: 2024

Published in: arXiv preprint arXiv ..., 2024 - arxiv.org

Institution: OpenAI, Google DeepMind, Google, University of Oxford

Research Area: Fairness in LLM, AI Bias, AI Ethics

Discipline: Artificial Intelligence, Social Science

The paper introduces a counterfactual approach to evaluate 'first-person fairness' in chatbots, demonstrating that reinforcement learning can mitigate biases based on demographics across extensive chatbot interactions.

Methods: The study uses a Language Model as a Research Assistant (LMRA) to quantitatively and qualitatively assess biases based on demographics across millions of chatbot interactions, covering 66 tasks in 9 domains and involving two genders and four races. Bias evaluations are corroborated by independent...

Key Findings: Demographic biases in chatbot responses, including harmful stereotypes and response differences by gender and race, across diverse tasks and domains.

DOI: https://doi.org/10.48550/arXiv.2410.19803

Citations: 33

Sample Size: 6000000