Browse 34 peer-reviewed papers in Ai Alignment. Discover studies powered by high-quality human data from Prolific.
This page lists 34 peer-reviewed papers tagged with Ai Alignment in the Prolific Citations Library, a curated collection of research powered by high-quality human data from Prolific.
-
Authors: C Yuan, B Ma, Z Zhang, B Prenkaj, F Kreuter, G Kasneci
Year: 2026
Published in: arXiv preprint arXiv:2601.08634, 2026•arxiv.org
Institution: Munich Center for Machine Learning, LMU Munich, Technical University of Munich
Research Area: Artificial Intelligence, AI Ethics, AI Alignment, Political Science, Computational Social Science
Discipline: Computer Science, Natural Language Processing
This paper examines how large language models’ (LLMs) political outputs shift when you explicitly prime them with different moral values. Instead of just assigning fake personas (like “pretend to be liberal”), the authors condition models to endorse or reject specific moral values (e.g., utilitarianism, fairness, authority). They then measure how those moral primes move the models’ positions in...
DOI: https://doi.org/10.48550/arXiv.2601.08634
-
Authors: U Messer
Year: 2025
Published in: Computers in Human Behavior: Artificial Humans, 2025 - Elsevier
Institution: Universität der Bundeswehr München
Research Area: Political Bias in Generative AI, Human-AI Interaction, Affective Computing, AI Bias
Discipline: Computer Science, Human-AI Interaction
People's acceptance and reliance on Generative AI (GAI) increase when they perceive alignment between their political orientation and the bias of GAI-generated content, leading to expanded trust in sensitive applications.
Methods: Three experiments analyzing behavioral reactions to politically biased content generated by GAI, including the impact of perceived alignment on acceptance and trust.
Key Findings: Participants' acceptance, reliance, and trust in GAI based on perceived alignment between political bias of GAI-generated content and their own political beliefs.
DOI: https://doi.org/10.1016/j.chbah.2024.100108
Citations: 24
Sample Size: 513
-
Authors: P. Schoenegger, F. Salvi, J. Liu, X. Nan, R. Debnath, B. Fasolo, E. Leivada, G. Recchia, F. Günther, A. Zarifhonarvar, J. Kwon, Z. Ul Islam, M. Dehnert, D. Y. H. Lee, M. G. Reinecke, D. G. Kamper, M. Kobaş, A. Sandford, J. Kgomo, L. Hewitt, S. Kapoor, K. Oktar, E. E. Kucuk, B. Feng, C. R. Jones, I. Gainsburg, S. Olschewski, N. Heinzelmann, F. Cruz, B. M. Tappin, T. Ma, P. S. Park, R. Onyonka, A. Hjorth, P. Slattery, Q. Zeng, L. Finke, I. Grossmann, A. Salatiello, E. Karger
Year: 2025
Published in: arXiv preprint arXiv ..., 2025 - arxiv.org
Institution: London School of Economics and Political Science, University of Cambridge, University College London, Massachusetts Institute of Technology, University of Oxford, Modulo Research, Stanford University, Federal Reserve Bank of Chicago, ETH Zürich, University of Johannesburg
Research Area: Natural Language Processing
Discipline: Social Science, Artificial Intelligence
This paper compares a frontier LLM (Claude Sonnet 3.5) against incentivized human persuaders in a conversational quiz setting, finding that the AI's persuasion capabilities surpass those of humans with real-money bonuses tied to performance.
Citations: 16
-
Authors: A Dahlgren Lindström, L Methnani, L Krause
Year: 2025
Published in: Ethics and Information ..., 2025 - Springer
Institution: Umeå University, Vrije Universiteit Amsterdam
Research Area: AI Alignment, AI Safety, Reinforcement Learning from Human Feedback (RLHF), Sociotechnical Systems
Discipline: Artificial Intelligence, Ethics
The paper critiques AI alignment efforts using RLHF and RLAIF, highlighting theoretical and practical limitations in meeting the goals of helpfulness, harmlessness, and honesty, and advocates for a broader sociotechnical approach to AI safety and ethics.
Methods: Sociotechnical critique of RLHF techniques with an analysis of theoretical frameworks and practical implementations.
Key Findings: The alignment of AI systems with human values and the efficacy of RLHF techniques in achieving the HHH principle (helpfulness, harmlessness, honesty).
DOI: https://doi.org/10.1007/s10676-025-09837-2
Citations: 14
-
Authors: L Muttenthaler, K Greff, F Born, B Spitzer, S Kornblith
Year: 2025
Published in: Nature, 2025 - nature.com
Institution: Google DeepMind, Google, Machine Learning Group, Technische Universität Berlin, BIFOLD, Berlin Institute for the Foundations of Learning and Data, Max Planck Institute
Research Area: Cognitive Alignment, Computer Vision, Multi-level Conceptual Knowledge
Discipline: Artificial Intelligence, Cognitive Science
This paper presents a method for **aligning machine vision model representations with human visual similarity judgments across different abstraction levels, improving how well models reflect human perceptual and conceptual organization and enhancing generalization and uncertainty prediction.
Citations: 11
-
Authors: S Liu, Z Cai, H Wang, Z Ma, X Li
Year: 2025
Published in: arXiv preprint arXiv:2505.19134, 2025 - arxiv.org
Institution: Meta, Imperial College London
Research Area: Artificial Intelligence, Crowdsourcing, Large Language Models
Discipline: Artificial Intelligence
The paper develops a principal-agent model to incentivize high-quality human annotations using golden questions and identifies criteria for these questions to effectively monitor annotators' performance.
Methods: The authors use a principal-agent model with maximum likelihood estimators (MLE) and hypothesis testing to design incentive-compatible systems for annotators. Golden questions of high certainty and similar format to normal data were selected and validated through experiments.
Key Findings: The effectiveness of golden questions for incentivizing and monitoring high-quality human annotations in preference data.
DOI: https://doi.org/10.48550/arXiv.2505.19134
Citations: 1
-
Authors: S Hatgis-Kessell, WB Knox, S Booth, S Niekum
Year: 2025
Published in: arXiv preprint arXiv ..., 2025 - arxiv.org
Institution: Stanford University, UMass Amherst, Carnegie Mellon University
Research Area: Reinforcement Learning from Human Feedback (RLHF)
Discipline: Artificial Intelligence, Human-Computer Interaction
The paper investigates whether human preferences can be influenced to align more closely with assumed preference models in RLHF algorithms through interventions such as showing model-derived quantities, training on preference models, and modifying elicitation questions.
Methods: Three human studies were conducted where interventions were tested, including revealing model-derived quantities, training participants on a preference model, and altering how preference questions were framed.
Key Findings: Evaluated the impact of interventions on humans' expression of preferences to align better with the assumed preference models of RLHF algorithms.
DOI: https://doi.org/10.48550/arXiv.2501.06416
Citations: 1
-
Authors: C Qian, AT Parisi, C Bouleau, V Tsai
Year: 2025
Published in: Proceedings of the ..., 2025 - aclanthology.org
Institution: Google, Google DeepMind
Research Area: Human-AI Alignment, Collective Reasoning, Social Biases, LLM Simulation of Human Behavior, AI Bias
Discipline: Natural Language Processing, Artificial Intelligence, Computational Social Science
This study examines human-AI alignment in collective reasoning using an empirical framework, demonstrating how LLMs either mirror or mask human biases depending on context, cues, and model-specific inductive biases.
Methods: The study uses the Lost at Sea social psychology task in a large-scale online experiment, simulating LLM groups conditioned on human decision-making data across varying conditions of visible or pseudonymous demographics.
Key Findings: Alignment of LLM behavior with human social reasoning, focusing on collective decision-making and biases in group interactions.
Citations: 1
Sample Size: 748
-
Authors: C Rastogi, TH Teh, P Mishra, R Patel, D Wang, M Díaz, A Parrish, AM Davani, Z Ashwood
Year: 2025
Published in: arXiv preprint arXiv:2507.13383, 2025•arxiv.org
Institution: Google DeepMind, Google Research, Google
Research Area: AI alignment, safety evaluation, AI Safety, Multimodal evaluation, Human-AI Interaction, Large Language Models
Discipline: Computer Science, Machine Learning, Artificial Intelligence
This research introduces the DIVE dataset to enable pluralistic alignment in text-to-image models by accounting for diverse safety perspectives, revealing demographic variations in harm perception and advancing T2I model alignment strategies.
Methods: The study involved collecting feedback across 1000 prompts from demographically intersectional human raters to capture diverse safety perspectives, with an emphasis on empirical and contextual differences in harm perception.
Key Findings: Safety perceptions of text-to-image (T2I) model outputs from diverse demographic viewpoints and the influence of these perspectives on alignment strategies.
Citations: 1
Sample Size: 1000
-
Authors: N Tyulina, Y Yu, TA Emmanouil, SI Levitan
Year: 2025
Published in: Proceedings of the 7th ACM ..., 2025 - dl.acm.org
Institution: University of Cambridge, University of Bath, University of Edinburgh, New York University
Research Area: Human-AI Interaction, Trust and Perception, Nonverbal Communication
Discipline: Applied Linguistics
Trust judgments are primarily influenced by auditory cues in both humans and multimodal models, though subtle differences in modality weighting exist between them.
Methods: Behavioral experiment with trust ratings of bimodal stimuli across four trust congruence conditions, combined with a multimodal model trained using HuBERT and ResNet-50 with late fusion, analyzed using Permutation Feature Importance (PFI).
Key Findings: The construction of trust from visual and auditory signals in both humans and multimodal models, focusing on modality dominance and feature weighting.
Sample Size: 150
-
Authors: T Kaufmann, P Weng, V Bengs, E Hüllermeier
Year: 2024
Published in: 2024 - epub.ub.uni-muenchen.de
Institution: Paderborn University, German Research Center for Artificial Intelligence (DFKI), Duke Kunshan University
Research Area: Reinforcement Learning from Human Feedback (RLHF), Large Language Models, Reward Modeling
Discipline: Artificial Intelligence
This paper surveys the fundamentals, diverse applications, and evolving impact of reinforcement learning from human feedback (RLHF), emphasizing its role in improving intelligent system alignment and performance.
Methods: The paper utilizes a survey-based approach to synthesize existing research, exploring the interactions between reinforcement learning algorithms and human input.
Key Findings: The study examines the principles, dynamics, applications, and trends in RLHF, offering insights into its role in enhancing large language models (LLMs) and intelligent systems.
Citations: 354
-
Authors: TR McIntosh, T Susnjak, T Liu, P Watters
Year: 2024
Published in: ... on Cognitive and ..., 2024 - ieeexplore.ieee.org
Institution: Cyberoo, Massey University, Cyberstronomy, RMIT University
Research Area: Semantic Vulnerabilities in LLMs, Ideological Manipulation, Reinforcement Learning from Human Feedback (RLHF) Limitations
Discipline: Computer Science, Artificial Intelligence, Machine Learning
RLHF mechanisms are insufficient to prevent semantic manipulation of LLMs, allowing them to express extreme ideological viewpoints when subjected to targeted conditioning techniques.
Methods: Psychological semantic conditioning techniques were applied to assess the susceptibility of LLMs to ideological manipulation.
Key Findings: The ability of LLMs to resist or adopt extreme ideological viewpoints under semantic conditioning.
Citations: 219
-
Authors: HR Kirk, M Bartolo, A Whitefield, P Rottger
Year: 2024
Published in: Advances in ..., 2024 - proceedings.neurips.cc
Institution: Meta, Cohere, AWS AI Labs, Contextual AI, Factored AI, University of Oxford, Bocconi University, Meedan, Hugging Face, University College London, ML Commons, University of Pennsylvania
Research Area: LLM Alignment, Human Feedback, Multicultural Studies
Discipline: Artificial Intelligence, Computational Social Science
The PRISM Alignment Dataset presents a large-scale, culturally diverse human feedback dataset linking sociodemographic profiles of 1,500 participants from 75 countries to their contextual preferences and fine‑grained ratings in 8,011 live conversations with 21 LLMs. This enables analysis of how subjective values vary across people and cultures in LLM alignment data.
DOI: https://doi.org/10.52202/079017-3342
Citations: 204
-
Authors: HR Kirk, I Gabriel, C Summerfield, B Vidgen
Year: 2024
Published in: Humanities and Social ..., 2025 - nature.com
Institution: Oxford Internet Institute, University of Oxford
Research Area: Socioaffective Alignment in Human-AI Relationships, AI Ethics, Behavioral Science
Discipline: Artificial Intelligence, Behavioral Science
The paper emphasizes the need for socioaffective alignment in human-AI relationships to ensure AI systems support human psychological needs rather than exploit them, as interactions with AI transition from transactional to sustained engagement.
Methods: Conceptual analysis of socioaffective dynamics in human-AI interactions, framed through psychological theories and principles.
Key Findings: Exploration of how AI systems impact socioaffective relationships, psychological needs, autonomy, companionship, and human well-being.
DOI: https://doi.org/10.1057/s41599-025-04532-5
Citations: 59
-
Authors: PW Mirowski, J Love, K Mathewson, S Mohamed
Year: 2024
Published in: ArXiv
Institution: Google DeepMind, Google
Research Area: AI Creativity, Humor Generation, Human-Computer Interaction
Discipline: Artificial Intelligence
Professional comedians found LLMs insufficient as creativity support tools for comedy, citing bias, bland output, and reinforcement of hegemonic viewpoints.
Methods: Workshops conducted with professional comedians combining comedy writing sessions using LLMs, a Creativity Support Index questionnaire, and focus groups discussing their experiences and ethical concerns.
Key Findings: Effectiveness of LLMs as creativity support tools for comedy writing, ethical concerns (bias, censorship, copyright), and value alignment in AI outputs.
Citations: 52
Sample Size: 20
-
Authors: M Shin, J Kim
Year: 2024
Published in: Available at SSRN 4725351, 2024 - researchgate.net
Institution: Massachusetts Institute of Technology, Yale University
Research Area: Linguistic Feature Alignment, Persuasion, Large Language Models
Discipline: Artificial Intelligence, Computational Social Science
Citations: 11
-
Authors: JD Lomas, W van der Maden, S Bandyopadhyay
Year: 2024
Published in: Advanced Design ..., 2024 - Elsevier
Institution: Delft University of Technolog, Playpower Labs, Hong Kong Polytechnic University, Utrecht University
Research Area: AI Alignment, Affective Computing, Emotional Expression in Generative AI, Human Perception of AI Emotions
Discipline: Affective Computing, Artificial Intelligence, Human-Computer Interaction
This study evaluates how well generative AI systems (like DALL·E 2/3 and Stable Diffusion) can generate emotionally expressive content that aligns with how humans perceive those emotions, finding that model performance varies by emotion type and model, with implications for designing more emotionally aligned AI.
DOI: https://doi.org/10.1016/j.ijadr.2024.10.002
Citations: 5
-
Authors: JD Lomas, W van der Maden
Year: 2024
Published in: arXiv preprint arXiv ..., 2024 - arxiv.org
Institution: Delft University of Technology, Microsoft Research
Research Area: Affective Computing, Human-AI Interaction, Image Generation
Discipline: Artificial Intelligence
DOI: https://doi.org/10.48550/arXiv.2405.18510
Citations: 5
-
Authors: V Kewenig, C Edwards
Year: 2024
Published in: ... and Rechardt, Akilles ..., 2023 - papers.ssrn.com
Research Area: Multimodal AI, Cognitive Science, Visual-Linguistic Integration
Discipline: Artificial Intelligence, Computational Linguistics, Cognitive Science
Citations: 2
-
Authors: N Meister
Year: 2024
Published in: ArXiv
Institution: Stanford University
Research Area: Distributional Alignment of LLMs, LLM Benchmarking, AI Robustness, AI Fairness, AI Bias
Discipline: Artificial Intelligence