Safe Ai: Research Area — Prolific Citations Library

Discover 4 peer-reviewed studies in Safe Ai (2023–2025). Explore research findings powered by Prolific's diverse participant panel.

This page lists 4 peer-reviewed papers in the research area of Safe Ai in the Prolific Citations Library, a curated collection of research powered by high-quality human data from Prolific.

Papers (4 of 4)

Whose view of safety? a deep dive dataset for pluralistic alignment of text-to-image models

Authors: C Rastogi, TH Teh, P Mishra, R Patel, D Wang, M Díaz, A Parrish, AM Davani, Z Ashwood

Year: 2025

Published in: arXiv preprint arXiv:2507.13383, 2025•arxiv.org

Institution: Google DeepMind, Google Research, Google

Research Area: AI alignment, safety evaluation, AI Safety, Multimodal evaluation, Human–AI interaction, LLM

Discipline: Computer Science, Machine Learning, Artificial Intelligence

This research introduces the DIVE dataset to enable pluralistic alignment in text-to-image models by accounting for diverse safety perspectives, revealing demographic variations in harm perception and advancing T2I model alignment strategies.

Methods: The study involved collecting feedback across 1000 prompts from demographically intersectional human raters to capture diverse safety perspectives, with an emphasis on empirical and contextual differences in harm perception.

Key Findings: Safety perceptions of text-to-image (T2I) model outputs from diverse demographic viewpoints and the influence of these perspectives on alignment strategies.

Citations: 1

Sample Size: 1000
HRLAIF: Improvements in helpfulness and harmlessness in open-domain reinforcement learning from ai feedback

Authors: A Li, Q Xiao, P Cao, J Tang, Y Yuan, Z Zhao

Year: 2024

Published in: arXiv preprint arXiv ..., 2024 - arxiv.org

Institution: Beijing University, Alibaba Group

Research Area: Reinforcement Learning from AI Feedback (RLAIF), Safety and Utility of Open-domain Language Models, Open Source LLM

Discipline: Artificial Intelligence

Citations: 12
SafetyAnalyst: Interpretable, transparent, and steerable LLM safety moderation

Authors: Jing-Jing Li♡♠ Valentina Pyatkin♠ Max Kleiman-Weiner♣ Liwei Jiang♣ Nouha Dziri♠ &Anne G. E. Collins♡ Jana Schaich Borg♢ Maarten Sap♠◆ Yejin Choi♣ Sydney Levine♠

Year: 2024

Published in: ArXiv

Institution: Allen Institute for AI, Duke University, University of California Berkeley, University of Washington

Research Area: LLM Safety Moderation, Interpretable AI (XAI), LLM Alignment, Steerable AI

Discipline: Artificial Intelligence
Safe RLHF: Safe reinforcement learning from human feedback

Authors: J Dai, X Pan, R Sun, J Ji, X Xu, M Liu, Y Wang

Year: 2023

Published in: arXiv preprint arXiv ..., 2023 - arxiv.org

Institution: Cornell University, Georgia Institute of Technology

Research Area: Reinforcement Learning from Human Feedback (RLHF), Safe AI, Reinforcement Learning

Discipline: Artificial Intelligence, Machine Learning

DOI: https://doi.org/10.48550/arXiv.2310.12773

Citations: 598