Multi-turn Evaluation of Anthropomorphic Behaviours in Large Language Models

Authors: L Ibrahim, C Akbulut, R Elasmar, C Rastogi, M Kahng, MR Morris, KR McKee, V Rieser, M Shanahan, L Weidinger

Published: 2025

Publication: arXiv preprint arXiv:2502.07077, 2025•arxiv.org

The paper evaluates anthropomorphic behaviors in SOTA LLMs through a multi-turn methodology, showing that such behaviors, including empathy and relationship-building, predominantly emerge after multiple interactions and influence user perceptions.

Methods: Multi-turn evaluation of 14 anthropomorphic behaviors using simulations of user interactions, validated by a large-scale human subject study.

Key Findings: Anthropomorphic behaviors in large language models, including relationship-building and pronoun usage, and their perception by users.

Limitations: The study focuses on model behaviors in specific interactive settings and may not generalize across all possible contexts of LLM deployment.

Institution: Google DeepMind, Google, University of Oxford

Research Area: Multimodal conversational AI, conversational AI, Evaluation methodology, benchmarking

Discipline: Computer Science , Natural Language Processing (NLP), Human–Computer Interaction (HCI)

Sample Size: 1101 participants

Citations: 26