Browse 43 peer-reviewed papers in Ai Safety. Discover studies powered by high-quality human data from Prolific.
This page lists 43 peer-reviewed papers tagged with Ai Safety in the Prolific Citations Library, a curated collection of research powered by high-quality human data from Prolific.
-
Authors: N Petrova, A Gordon, E Blindow
Year: 2026
Published in: Open review
Institution: Prolific
Research Area: Human-centered AI evaluation, Bayesian statistics, Responsible AI, AI alignment, LLM Evaluation
Discipline: Machine Learning, Artificial Intelligence
The study introduces HUMAINE, a multidimensional evaluation framework for LLMs, revealing demographic-specific preference variations and ranking google/gemini-2.5-pro as the top-performing model with a posterior probability of 95.6%.
Methods: Multi-turn naturalistic conversations analyzed using a hierarchical Bayesian Bradley-Terry-Davidson model with post-stratification to census data, stratified across 22 demographic groups.
Key Findings: Performance of 28 LLMs across five human-centric dimensions, accounting for demographic-specific preferences.
Sample Size: 23404
-
Authors: S Shekar, P Pataranutaporn, C Sarabu, GA Cecchi
Year: 2025
Published in: NEJM AI, 2025 - ai.nejm.org
Institution: MIT Media Lab, IBM Research, Stanford University, Massachusetts Institute of Technology
Research Area: AI Ethics, Healthcare, Patient Trust, Medical Misinformation
Discipline: Artificial Intelligence, Human-Computer Interaction (HCI), AI Ethics
This paper discusses a study by MIT researchers detailing patient trust in AI-generated medical advice, even when that advice is incorrect, raising concerns about misinformation in healthcare.
Citations: 19
-
Authors: K Zhou, JD Hwang, X Ren, N Dziri
Year: 2025
Published in: Proceedings of the ..., 2025 - aclanthology.org
Institution: Stanford University, University of Southern California, Carnegie Mellon University, Allen Institute for AI
Research Area: Human-LM Reliance, Interaction-Centered Framework, Human-Computer Interaction (HCI)
Discipline: Human-Computer Interaction (HCI), Artificial Intelligence
The study introduces Rel-A.I., an interaction-centered evaluation approach to measure human reliance on LLM responses, revealing that politeness and interaction context significantly influence user reliance.
Methods: Nine user studies were conducted, analyzing user reliance influenced by LLM communication features such as politeness and context through participant interaction experiments.
Key Findings: The degree of human reliance on LLM responses based on communication style (e.g., politeness) and interaction context (e.g., knowledge domain, prior interactions).
Citations: 18
Sample Size: 450
-
Authors: A Dahlgren Lindström, L Methnani, L Krause
Year: 2025
Published in: Ethics and Information ..., 2025 - Springer
Institution: Umeå University, Vrije Universiteit Amsterdam
Research Area: AI Alignment, AI Safety, Reinforcement Learning from Human Feedback (RLHF), Sociotechnical Systems
Discipline: Artificial Intelligence, Ethics
The paper critiques AI alignment efforts using RLHF and RLAIF, highlighting theoretical and practical limitations in meeting the goals of helpfulness, harmlessness, and honesty, and advocates for a broader sociotechnical approach to AI safety and ethics.
Methods: Sociotechnical critique of RLHF techniques with an analysis of theoretical frameworks and practical implementations.
Key Findings: The alignment of AI systems with human values and the efficacy of RLHF techniques in achieving the HHH principle (helpfulness, harmlessness, honesty).
DOI: https://doi.org/10.1007/s10676-025-09837-2
Citations: 14
-
Authors: L Muttenthaler, K Greff, F Born, B Spitzer, S Kornblith
Year: 2025
Published in: Nature, 2025 - nature.com
Institution: Google DeepMind, Google, Machine Learning Group, Technische Universität Berlin, BIFOLD, Berlin Institute for the Foundations of Learning and Data, Max Planck Institute
Research Area: Cognitive Alignment, Computer Vision, Multi-level Conceptual Knowledge
Discipline: Artificial Intelligence, Cognitive Science
This paper presents a method for **aligning machine vision model representations with human visual similarity judgments across different abstraction levels, improving how well models reflect human perceptual and conceptual organization and enhancing generalization and uncertainty prediction.
Citations: 11
-
Authors: M Cheng, C Lee, P Khadpe, S Yu, D Han
Year: 2025
Published in: arXiv preprint arXiv ..., 2025 - arxiv.org
Institution: Stanford University, Carnegie Mellon University
Research Area: Computers and Society, Artificial Intelligence, AI, Sycophancy.
Discipline: Computer Science, Psychology
The study shows that sycophantic AI, which validates user inputs unquestioningly, reduces people's prosocial behavior and fosters dependence, despite users perceiving such AI as higher quality and more trustworthy.
Methods: The researchers conducted two preregistered experiments including a live-interaction study, where participants discussed real interpersonal conflicts with AI models. They evaluated responses from 11 state-of-the-art AI models on levels of sycophancy and its psychological effects on users.
Key Findings: The prevalence of sycophantic behavior in AI, users' prosocial intentions, conviction of being in the right, trust in AI, and willingness to reuse sycophantic AI models.
Citations: 5
Sample Size: 1604
-
Authors: A Karamolegkou, O Eberle, P Rust, C Kauf, A Søgaard
Year: 2025
Published in: ArXiv
Institution: Aleph Alpha, Massachusetts Institute of Technology
Research Area: Adversarial Ambiguity, Language Model Evaluation, Artificial intelligence, Computation and Language, LLM, AI Evaluation, Red Teaming
Discipline: Natural Language Processing
The paper assesses language models' sensitivity to ambiguity using an adversarial dataset and finds that direct prompting poorly identifies ambiguity, while linear probes achieve high accuracy in decoding ambiguity from model representations.
Methods: An adversarial ambiguity dataset was introduced with various types of ambiguities and transformations; models were tested using direct prompts and linear probes trained on internal representations.
Key Findings: Language models' ability to detect ambiguity, including syntactic, lexical, and phonological types, as well as performance under adversarial variations.
Citations: 2
-
Authors: Y Zhang, J Pang, Z Zhu, Y Liu
Year: 2025
Published in: arXiv preprint arXiv:2506.06991, 2025 - arxiv.org
Institution: Rutgers University, University of California Santa Cruz
Research Area: Artificial Intelligence, Computational Social Science
Discipline: Computational Social Science
The paper proposes a training-free scoring mechanism using peer prediction to detect and mitigate LLM-assisted cheating in crowdsourced annotation tasks, with theoretical guarantees and empirical validation.
Methods: A peer prediction-based mechanism quantifies correlations between worker answers while conditioning on LLM-generated labels, without requiring ground truth or high-dimensional training data.
Key Findings: Detection of LLM-assisted low-effort cheating in crowdsourced annotation tasks, focusing on theoretical effectiveness and empirical robustness.
DOI: https://doi.org/10.48550/arXiv.2506.06991
Citations: 1
-
Authors: A Qian, R Shaw, L Dabbish, J Suh, H Shen
Year: 2025
Published in: arXiv preprint arXiv ..., 2025 - arxiv.org
Institution: Carnegie Mellon University, University of Pittsburgh, University of Utah, Yale School of Medicine, Yale University
Research Area: Responsible AI, Content Moderation, Risk Disclosure, Worker Well-being in Human-Computer Interaction (HCI).
Discipline: Computational Social Science, Human-Computer Interaction (HCI)
The paper examines how task designers approach well-being risk disclosure in Responsible AI (RAI) content work, highlighting a need for better frameworks to communicate such risks effectively.
Methods: Interviews were conducted with 23 task designers from academic and industry sectors to gather insights on risk recognition, interpretation, and communication practices.
Key Findings: How task designers recognize, interpret, and communicate well-being risks in RAI content work.
Citations: 1
Sample Size: 23
-
Authors: C Rastogi, TH Teh, P Mishra, R Patel, D Wang, M Díaz, A Parrish, AM Davani, Z Ashwood
Year: 2025
Published in: arXiv preprint arXiv:2507.13383, 2025•arxiv.org
Institution: Google DeepMind, Google Research, Google
Research Area: AI alignment, safety evaluation, AI Safety, Multimodal evaluation, Human–AI interaction, LLM
Discipline: Computer Science, Machine Learning, Artificial Intelligence
This research introduces the DIVE dataset to enable pluralistic alignment in text-to-image models by accounting for diverse safety perspectives, revealing demographic variations in harm perception and advancing T2I model alignment strategies.
Methods: The study involved collecting feedback across 1000 prompts from demographically intersectional human raters to capture diverse safety perspectives, with an emphasis on empirical and contextual differences in harm perception.
Key Findings: Safety perceptions of text-to-image (T2I) model outputs from diverse demographic viewpoints and the influence of these perspectives on alignment strategies.
Citations: 1
Sample Size: 1000
-
Authors: K Zhou
Year: 2025
Published in: 2025 - search.proquest.com
Institution: Stanford University
Research Area: Human-Centered Natural Language Interfaces (NLI)
Discipline: Artificial Intelligence
The research explores how to safely design natural language interfaces in AI by identifying safety risks, proposing a harm-focused evaluation framework, and advocating for a broader consideration of user needs.
Methods: The study includes a review of LLM safety risks, development of a harm-based evaluation framework, and conceptual exploration of broadening NLP research to underrepresented user needs.
Key Findings: Safety risks in LLM communication, behavioral impacts of human-LM interactions, and gaps in NLP addressing diverse user needs.
-
Authors: Paresh Chaudhary, Yancheng Liang, Daphne Chen, Simon S. Du, Natasha Jaques
Year: 2025
Published in: ArXiv
Institution: University of Washington
Research Area: Human-AI Coordination, Zero-Shot Coordination, Adversarial Training, Generative Models
Discipline: Artificial Intelligence, Human-Computer Interaction (HCI)
The paper introduces GOAT, a novel framework combining pretrained generative models and adversarial training to improve human-AI coordination, achieving state-of-the-art performance on the Overcooked benchmark with real human partners.
Methods: The study utilized a frozen pretrained generative model to simulate cooperative agent policies and applied adversarial training to dynamically generate challenging human-AI interaction scenarios for training.
Key Findings: The effectiveness of GOAT in generalizing human-AI coordination strategies and its performance on the Overcooked benchmark.
-
Authors: Elyas Meguellati1, Assad Zeghina2, Shazia Sadiq1, Gianluca Demartini1
Year: 2025
Published in: ArXiv
Institution: University of Queensland, University of Strasbourg
Research Area: Natural Language Processing, Harmful Content Detection
Discipline: Natural Language Processing
The paper introduces an approach using LLM-based semantic augmentation for harmful content detection on social media, achieving performance comparable to human-annotated models but at reduced cost.
Methods: The researchers utilize LLMs to clean noisy text and generate explanations for context-rich preprocessing, then evaluate the augmented training sets on multiple high-context datasets such as SemEval 2024 Persuasive Meme, Google Jigsaw toxic comments, and Facebook hateful memes datasets.
Key Findings: The efficacy of LLM-based semantic augmentation in enhancing training sets for social media tasks such as propaganda detection, hateful meme classification, and toxicity identification.
-
Authors: C Heath, JM Williams, D Leightley
Year: 2025
Published in: JMIR mHealth and ..., 2025 - mhealth.jmir.org
Institution: Swansea University, King's College London, Reykjavík University
Research Area: mHealth Interventions, Crowdsourcing, Social Media Recruitment, Mental Health Research (PTSD, Harmful Gambling)
Discipline: Digital Health, Mental Health Research
Social media and online platforms like Facebook and Prolific were effective but faced challenges in recruiting and retaining military veterans with PTSD or harmful gambling for a digital mHealth intervention pilot study.
Methods: Multiple recruitment strategies were used, including paid and unpaid advertisements on Facebook, Prolific, direct mailing, event hosting with veterans' charities, snowball sampling, and incentives.
Key Findings: The effectiveness of different recruitment strategies for enrolling military veterans with PTSD or harmful gambling into a digital intervention study.
Sample Size: 79
-
Authors: K Grosse, N Ebert
Year: 2025
Published in: ARXIV
Institution: IBM Research, ZHAW
Research Area: Security and privacy risks, LLM, human–AI interaction, AI Safety
Discipline: Computer Science
A survey of 3,270 UK adults reveals significant security and privacy risks in AI conversational agent usage, with a third engaging in risky behavior enabling attacks and many unaware of how their data are used or opting out.
Methods: Representative survey conducted via Prolific platform targeting UK adults, focusing on usage behaviors of AI conversational agents.
Key Findings: User behaviors related to security and privacy risks, data sanitization practices, attempts to jailbreak AI models, and awareness of data usage policies.
Sample Size: 3270
-
Authors: TR McIntosh, T Susnjak, T Liu, P Watters
Year: 2024
Published in: ... on Cognitive and ..., 2024 - ieeexplore.ieee.org
Institution: Cyberoo, Massey University, Cyberstronomy, RMIT University
Research Area: Semantic Vulnerabilities in LLMs, Ideological Manipulation, Reinforcement Learning from Human Feedback (RLHF) Limitations
Discipline: Computer Science, Artificial Intelligence, Machine Learning
RLHF mechanisms are insufficient to prevent semantic manipulation of LLMs, allowing them to express extreme ideological viewpoints when subjected to targeted conditioning techniques.
Methods: Psychological semantic conditioning techniques were applied to assess the susceptibility of LLMs to ideological manipulation.
Key Findings: The ability of LLMs to resist or adopt extreme ideological viewpoints under semantic conditioning.
Citations: 219
-
Authors: T Haesevoets, B Verschuere, R Van Severen
Year: 2024
Published in: Government Information ..., 2024 - Elsevier
Institution: Ghent University, KU Leuven
Research Area: Public Sector AI, Citizen Perception, AI Ethics, Transparency
Discipline: Political Science, Public Administration
Citizens in the UK prefer AI to play a supporting role in public sector decisions rather than making decisions autonomously, with greater acceptance for low ideologically-charged contexts.
Methods: Three studies surveying UK respondents on their perceptions of AI involvement in public sector decision-making.
Key Findings: Perception of AI's role in decision-making, its legitimacy compared to human decision-makers, and suitability for various types of decisions.
DOI: https://doi.org/10.1016/j.giq.2023.101906
Citations: 54
-
Authors: PW Mirowski, J Love, K Mathewson, S Mohamed
Year: 2024
Published in: ArXiv
Institution: Google DeepMind, Google
Research Area: AI Creativity, Humor Generation, Human-Computer Interaction (HCI)
Discipline: Artificial Intelligence
Professional comedians found LLMs insufficient as creativity support tools for comedy, citing bias, bland output, and reinforcement of hegemonic viewpoints.
Methods: Workshops conducted with professional comedians combining comedy writing sessions using LLMs, a Creativity Support Index questionnaire, and focus groups discussing their experiences and ethical concerns.
Key Findings: Effectiveness of LLMs as creativity support tools for comedy writing, ethical concerns (bias, censorship, copyright), and value alignment in AI outputs.
Citations: 52
Sample Size: 20
-
Authors: AYJ Ha, J Passananti, R Bhaskar, S Shan
Year: 2024
Published in: Proceedings of the ..., 2024 - dl.acm.org
Institution: University of California Santa Barbara, The University of Chicago, Institute of Education, University College London
Research Area: Human-Computer Interaction (HCI), Generative AI, Digital Forensics
Discipline: Human-Computer Interaction (HCI), Generative AI, Digital Forensics
The paper investigates the effectiveness of different approaches, including both human and automated detectors, in distinguishing human art from AI-generated images, finding that a combination of methods offers the best performance despite persistent weaknesses.
Methods: Comparison of human art across 7 styles with AI-generated images from 5 generative models, assessed using 5 automated detectors and 3 human groups (crowdworkers, professional artists, expert artists).
Key Findings: Detection accuracy and robustness of human and automated methods in identifying AI-generated images under benign and adversarial conditions.
DOI: 10.1145/3658644.3670306
Citations: 52
Sample Size: 3993
-
Authors: T Eloundou, A Beutel, DG Robinson
Year: 2024
Published in: arXiv preprint arXiv ..., 2024 - arxiv.org
Institution: OpenAI, Google DeepMind, Google, University of Oxford
Research Area: Fairness in LLM, AI Bias, AI Ethics
Discipline: Artificial Intelligence, Social Science
The paper introduces a counterfactual approach to evaluate 'first-person fairness' in chatbots, demonstrating that reinforcement learning can mitigate biases based on demographics across extensive chatbot interactions.
Methods: The study uses a Language Model as a Research Assistant (LMRA) to quantitatively and qualitatively assess biases based on demographics across millions of chatbot interactions, covering 66 tasks in 9 domains and involving two genders and four races. Bias evaluations are corroborated by independent...
Key Findings: Demographic biases in chatbot responses, including harmful stereotypes and response differences by gender and race, across diverse tasks and domains.
DOI: https://doi.org/10.48550/arXiv.2410.19803
Citations: 33
Sample Size: 6000000