Ai Alignment Studies

This page lists 34 peer-reviewed papers tagged with Ai Alignment in the Prolific Citations Library, a curated collection of research powered by high-quality human data from Prolific.

Papers (20 of 34)

Moral Lenses, Political Coordinates: Towards Ideological Positioning of Morally Conditioned LLMs

Authors: C Yuan, B Ma, Z Zhang, B Prenkaj, F Kreuter, G Kasneci

Year: 2026

Published in: arXiv preprint arXiv:2601.08634, 2026•arxiv.org

Institution: Munich Center for Machine Learning, LMU Munich, Technical University of Munich

Research Area: Artificial Intelligence, AI Ethics, AI Alignment, Political Science, Computational Social Science

Discipline: Computer Science, Natural Language Processing

This paper examines how large language models’ (LLMs) political outputs shift when you explicitly prime them with different moral values. Instead of just assigning fake personas (like “pretend to be liberal”), the authors condition models to endorse or reject specific moral values (e.g., utilitarianism, fairness, authority). They then measure how those moral primes move the models’ positions in...

DOI: https://doi.org/10.48550/arXiv.2601.08634
How do people react to political bias in generative Artificial Intelligence?

Authors: U Messer

Year: 2025

Published in: Computers in Human Behavior: Artificial Humans, 2025 - Elsevier

Institution: Universität der Bundeswehr München

Research Area: Political Bias in Generative AI, Human-AI Interaction, Affective Computing, AI Bias

Discipline: Computer Science, Human-AI Interaction

People's acceptance and reliance on Generative AI (GAI) increase when they perceive alignment between their political orientation and the bias of GAI-generated content, leading to expanded trust in sensitive applications.

Methods: Three experiments analyzing behavioral reactions to politically biased content generated by GAI, including the impact of perceived alignment on acceptance and trust.

Key Findings: Participants' acceptance, reliance, and trust in GAI based on perceived alignment between political bias of GAI-generated content and their own political beliefs.

DOI: https://doi.org/10.1016/j.chbah.2024.100108

Citations: 24

Sample Size: 513
Large Language Models Are More Persuasive Than Incentivized Human Persuaders

Authors: P. Schoenegger, F. Salvi, J. Liu, X. Nan, R. Debnath, B. Fasolo, E. Leivada, G. Recchia, F. Günther, A. Zarifhonarvar, J. Kwon, Z. Ul Islam, M. Dehnert, D. Y. H. Lee, M. G. Reinecke, D. G. Kamper, M. Kobaş, A. Sandford, J. Kgomo, L. Hewitt, S. Kapoor, K. Oktar, E. E. Kucuk, B. Feng, C. R. Jones, I. Gainsburg, S. Olschewski, N. Heinzelmann, F. Cruz, B. M. Tappin, T. Ma, P. S. Park, R. Onyonka, A. Hjorth, P. Slattery, Q. Zeng, L. Finke, I. Grossmann, A. Salatiello, E. Karger

Year: 2025

Published in: arXiv preprint arXiv ..., 2025 - arxiv.org

Institution: London School of Economics and Political Science, University of Cambridge, University College London, Massachusetts Institute of Technology, University of Oxford, Modulo Research, Stanford University, Federal Reserve Bank of Chicago, ETH Zürich, University of Johannesburg

Research Area: Natural Language Processing

Discipline: Social Science, Artificial Intelligence

This paper compares a frontier LLM (Claude Sonnet 3.5) against incentivized human persuaders in a conversational quiz setting, finding that the AI's persuasion capabilities surpass those of humans with real-money bonuses tied to performance.

Citations: 16
Helpful, harmless, honest? Sociotechnical limits of AI alignment and safety through Reinforcement Learning from Human Feedback: AD Lindström et al.

Authors: A Dahlgren Lindström, L Methnani, L Krause

Year: 2025

Published in: Ethics and Information ..., 2025 - Springer

Institution: Umeå University, Vrije Universiteit Amsterdam

Research Area: AI Alignment, AI Safety, Reinforcement Learning from Human Feedback (RLHF), Sociotechnical Systems

Discipline: Artificial Intelligence, Ethics

The paper critiques AI alignment efforts using RLHF and RLAIF, highlighting theoretical and practical limitations in meeting the goals of helpfulness, harmlessness, and honesty, and advocates for a broader sociotechnical approach to AI safety and ethics.

Methods: Sociotechnical critique of RLHF techniques with an analysis of theoretical frameworks and practical implementations.

Key Findings: The alignment of AI systems with human values and the efficacy of RLHF techniques in achieving the HHH principle (helpfulness, harmlessness, honesty).

DOI: https://doi.org/10.1007/s10676-025-09837-2

Citations: 14
Aligning machine and human visual representations across abstraction levels

Authors: L Muttenthaler, K Greff, F Born, B Spitzer, S Kornblith

Year: 2025

Published in: Nature, 2025 - nature.com

Institution: Google DeepMind, Google, Machine Learning Group, Technische Universität Berlin, BIFOLD, Berlin Institute for the Foundations of Learning and Data, Max Planck Institute

Research Area: Cognitive Alignment, Computer Vision, Multi-level Conceptual Knowledge

Discipline: Artificial Intelligence, Cognitive Science

This paper presents a method for **aligning machine vision model representations with human visual similarity judgments across different abstraction levels, improving how well models reflect human perceptual and conceptual organization and enhancing generalization and uncertainty prediction.

Citations: 11
Incentivizing High-Quality Human Annotations with Golden Questions

Authors: S Liu, Z Cai, H Wang, Z Ma, X Li

Year: 2025

Published in: arXiv preprint arXiv:2505.19134, 2025 - arxiv.org

Institution: Meta, Imperial College London

Research Area: Artificial Intelligence, Crowdsourcing, Large Language Models

Discipline: Artificial Intelligence

The paper develops a principal-agent model to incentivize high-quality human annotations using golden questions and identifies criteria for these questions to effectively monitor annotators' performance.

Methods: The authors use a principal-agent model with maximum likelihood estimators (MLE) and hypothesis testing to design incentive-compatible systems for annotators. Golden questions of high certainty and similar format to normal data were selected and validated through experiments.

Key Findings: The effectiveness of golden questions for incentivizing and monitoring high-quality human annotations in preference data.

DOI: https://doi.org/10.48550/arXiv.2505.19134

Citations: 1
Influencing Humans to Conform to Preference Models for RLHF

Authors: S Hatgis-Kessell, WB Knox, S Booth, S Niekum

Year: 2025

Published in: arXiv preprint arXiv ..., 2025 - arxiv.org

Institution: Stanford University, UMass Amherst, Carnegie Mellon University

Research Area: Reinforcement Learning from Human Feedback (RLHF)

Discipline: Artificial Intelligence, Human-Computer Interaction

The paper investigates whether human preferences can be influenced to align more closely with assumed preference models in RLHF algorithms through interventions such as showing model-derived quantities, training on preference models, and modifying elicitation questions.

Methods: Three human studies were conducted where interventions were tested, including revealing model-derived quantities, training participants on a preference model, and altering how preference questions were framed.

Key Findings: Evaluated the impact of interventions on humans' expression of preferences to align better with the assumed preference models of RLHF algorithms.

DOI: https://doi.org/10.48550/arXiv.2501.06416

Citations: 1
To Mask or to Mirror: Human-AI Alignment in Collective Reasoning

Authors: C Qian, AT Parisi, C Bouleau, V Tsai

Year: 2025

Published in: Proceedings of the ..., 2025 - aclanthology.org

Institution: Google, Google DeepMind

Research Area: Human-AI Alignment, Collective Reasoning, Social Biases, LLM Simulation of Human Behavior, AI Bias

Discipline: Natural Language Processing, Artificial Intelligence, Computational Social Science

This study examines human-AI alignment in collective reasoning using an empirical framework, demonstrating how LLMs either mirror or mask human biases depending on context, cues, and model-specific inductive biases.

Methods: The study uses the Lost at Sea social psychology task in a large-scale online experiment, simulating LLM groups conditioned on human decision-making data across varying conditions of visible or pseudonymous demographics.

Key Findings: Alignment of LLM behavior with human social reasoning, focusing on collective decision-making and biases in group interactions.

Citations: 1

Sample Size: 748
Whose view of safety? a deep dive dataset for pluralistic alignment of text-to-image models

Authors: C Rastogi, TH Teh, P Mishra, R Patel, D Wang, M Díaz, A Parrish, AM Davani, Z Ashwood

Year: 2025

Published in: arXiv preprint arXiv:2507.13383, 2025•arxiv.org

Institution: Google DeepMind, Google Research, Google

Research Area: AI alignment, safety evaluation, AI Safety, Multimodal evaluation, Human-AI Interaction, Large Language Models

Discipline: Computer Science, Machine Learning, Artificial Intelligence

This research introduces the DIVE dataset to enable pluralistic alignment in text-to-image models by accounting for diverse safety perspectives, revealing demographic variations in harm perception and advancing T2I model alignment strategies.

Methods: The study involved collecting feedback across 1000 prompts from demographically intersectional human raters to capture diverse safety perspectives, with an emphasis on empirical and contextual differences in harm perception.

Key Findings: Safety perceptions of text-to-image (T2I) model outputs from diverse demographic viewpoints and the influence of these perspectives on alignment strategies.

Citations: 1

Sample Size: 1000
Beyond Face Value: Visual and Auditory Signals in Human and Machine Trust Judgments

Authors: N Tyulina, Y Yu, TA Emmanouil, SI Levitan

Year: 2025

Published in: Proceedings of the 7th ACM ..., 2025 - dl.acm.org

Institution: University of Cambridge, University of Bath, University of Edinburgh, New York University

Research Area: Human-AI Interaction, Trust and Perception, Nonverbal Communication

Discipline: Applied Linguistics

Trust judgments are primarily influenced by auditory cues in both humans and multimodal models, though subtle differences in modality weighting exist between them.

Methods: Behavioral experiment with trust ratings of bimodal stimuli across four trust congruence conditions, combined with a multimodal model trained using HuBERT and ResNet-50 with late fusion, analyzed using Permutation Feature Importance (PFI).

Key Findings: The construction of trust from visual and auditory signals in both humans and multimodal models, focusing on modality dominance and feature weighting.

Sample Size: 150
A survey of reinforcement learning from human feedback

Authors: T Kaufmann, P Weng, V Bengs, E Hüllermeier

Year: 2024

Published in: 2024 - epub.ub.uni-muenchen.de

Institution: Paderborn University, German Research Center for Artificial Intelligence (DFKI), Duke Kunshan University

Research Area: Reinforcement Learning from Human Feedback (RLHF), Large Language Models, Reward Modeling

Discipline: Artificial Intelligence

This paper surveys the fundamentals, diverse applications, and evolving impact of reinforcement learning from human feedback (RLHF), emphasizing its role in improving intelligent system alignment and performance.

Methods: The paper utilizes a survey-based approach to synthesize existing research, exploring the interactions between reinforcement learning algorithms and human input.

Key Findings: The study examines the principles, dynamics, applications, and trends in RLHF, offering insights into its role in enhancing large language models (LLMs) and intelligent systems.

Citations: 354
The inadequacy of reinforcement learning from human feedback - radicalizing large language models via semantic vulnerabilities

Authors: TR McIntosh, T Susnjak, T Liu, P Watters

Year: 2024

Published in: ... on Cognitive and ..., 2024 - ieeexplore.ieee.org

Institution: Cyberoo, Massey University, Cyberstronomy, RMIT University

Research Area: Semantic Vulnerabilities in LLMs, Ideological Manipulation, Reinforcement Learning from Human Feedback (RLHF) Limitations

Discipline: Computer Science, Artificial Intelligence, Machine Learning

RLHF mechanisms are insufficient to prevent semantic manipulation of LLMs, allowing them to express extreme ideological viewpoints when subjected to targeted conditioning techniques.

Methods: Psychological semantic conditioning techniques were applied to assess the susceptibility of LLMs to ideological manipulation.

Key Findings: The ability of LLMs to resist or adopt extreme ideological viewpoints under semantic conditioning.

Citations: 219
The PRISM alignment dataset: What participatory, representative and individualised human feedback reveals about the subjective and multicultural alignment of large ...

Authors: HR Kirk, M Bartolo, A Whitefield, P Rottger

Year: 2024

Published in: Advances in ..., 2024 - proceedings.neurips.cc

Institution: Meta, Cohere, AWS AI Labs, Contextual AI, Factored AI, University of Oxford, Bocconi University, Meedan, Hugging Face, University College London, ML Commons, University of Pennsylvania

Research Area: LLM Alignment, Human Feedback, Multicultural Studies

Discipline: Artificial Intelligence, Computational Social Science

The PRISM Alignment Dataset presents a large-scale, culturally diverse human feedback dataset linking sociodemographic profiles of 1,500 participants from 75 countries to their contextual preferences and fine‑grained ratings in 8,011 live conversations with 21 LLMs. This enables analysis of how subjective values vary across people and cultures in LLM alignment data.

DOI: https://doi.org/10.52202/079017-3342

Citations: 204
Why human-AI relationships need socioaffective alignment

Authors: HR Kirk, I Gabriel, C Summerfield, B Vidgen

Year: 2024

Published in: Humanities and Social ..., 2025 - nature.com

Institution: Oxford Internet Institute, University of Oxford

Research Area: Socioaffective Alignment in Human-AI Relationships, AI Ethics, Behavioral Science

Discipline: Artificial Intelligence, Behavioral Science

The paper emphasizes the need for socioaffective alignment in human-AI relationships to ensure AI systems support human psychological needs rather than exploit them, as interactions with AI transition from transactional to sustained engagement.

Methods: Conceptual analysis of socioaffective dynamics in human-AI interactions, framed through psychological theories and principles.

Key Findings: Exploration of how AI systems impact socioaffective relationships, psychological needs, autonomy, companionship, and human well-being.

DOI: https://doi.org/10.1057/s41599-025-04532-5

Citations: 59
A Robot Walks into a Bar: Can Language Models Serve as Creativity Support Tools for Comedy? An Evaluation of LLMs’ Humour Alignment with Comedians

Authors: PW Mirowski, J Love, K Mathewson, S Mohamed

Year: 2024

Published in: ArXiv

Institution: Google DeepMind, Google

Research Area: AI Creativity, Humor Generation, Human-Computer Interaction

Discipline: Artificial Intelligence

Professional comedians found LLMs insufficient as creativity support tools for comedy, citing bias, bland output, and reinforcement of hegemonic viewpoints.

Methods: Workshops conducted with professional comedians combining comedy writing sessions using LLMs, a Creativity Support Index questionnaire, and focus groups discussing their experiences and ethical concerns.

Key Findings: Effectiveness of LLMs as creativity support tools for comedy writing, ethical concerns (bias, censorship, copyright), and value alignment in AI outputs.

Citations: 52

Sample Size: 20
Large language models can enhance persuasion through linguistic feature alignment

Authors: M Shin, J Kim

Year: 2024

Published in: Available at SSRN 4725351, 2024 - researchgate.net

Institution: Massachusetts Institute of Technology, Yale University

Research Area: Linguistic Feature Alignment, Persuasion, Large Language Models

Discipline: Artificial Intelligence, Computational Social Science

Citations: 11
Evaluating the alignment of AI with human emotions

Authors: JD Lomas, W van der Maden, S Bandyopadhyay

Year: 2024

Published in: Advanced Design ..., 2024 - Elsevier

Institution: Delft University of Technolog, Playpower Labs, Hong Kong Polytechnic University, Utrecht University

Research Area: AI Alignment, Affective Computing, Emotional Expression in Generative AI, Human Perception of AI Emotions

Discipline: Affective Computing, Artificial Intelligence, Human-Computer Interaction

This study evaluates how well generative AI systems (like DALL·E 2/3 and Stable Diffusion) can generate emotionally expressive content that aligns with how humans perceive those emotions, finding that model performance varies by emotion type and model, with implications for designing more emotionally aligned AI.

DOI: https://doi.org/10.1016/j.ijadr.2024.10.002

Citations: 5
Improved emotional alignment of ai and humans: Human ratings of emotions expressed by stable diffusion v1, dall-e 2, and dall-e 3

Authors: JD Lomas, W van der Maden

Year: 2024

Published in: arXiv preprint arXiv ..., 2024 - arxiv.org

Institution: Delft University of Technology, Microsoft Research

Research Area: Affective Computing, Human-AI Interaction, Image Generation

Discipline: Artificial Intelligence

DOI: https://doi.org/10.48550/arXiv.2405.18510

Citations: 5
Evidence of human-like visual-linguistic integration in multimodal large language models during predictive language processing

Authors: V Kewenig, C Edwards

Year: 2024

Published in: ... and Rechardt, Akilles ..., 2023 - papers.ssrn.com

Research Area: Multimodal AI, Cognitive Science, Visual-Linguistic Integration

Discipline: Artificial Intelligence, Computational Linguistics, Cognitive Science

Citations: 2
Benchmarking Distributional Alignment of Large Language Models

Authors: N Meister

Year: 2024

Published in: ArXiv

Institution: Stanford University

Research Area: Distributional Alignment of LLMs, LLM Benchmarking, AI Robustness, AI Fairness, AI Bias

Discipline: Artificial Intelligence