SafetyAnalyst: Interpretable, transparent, and steerable LLM safety moderation
Authors: Jing-Jing Li♡♠ Valentina Pyatkin♠ Max Kleiman-Weiner♣ Liwei Jiang♣ Nouha Dziri♠ &Anne G. E. Collins♡ Jana Schaich Borg♢ Maarten Sap♠◆ Yejin Choi♣ Sydney Levine♠
Published: 2024
Publication: ArXiv
Research paper: SafetyAnalyst: Interpretable, transparent, and steerable LLM safety moderation
Institution: Allen Institute for AI,Duke University,University of California Berkeley,University of Washington
Research Area: LLM Safety Moderation, Interpretable AI (XAI), LLM Alignment,Steerable AI
Discipline: Artificial Intelligence