Can large language models assess personality from asynchronous video interviews? A comprehensive evaluation of validity, reliability, fairness, and rating patterns
Authors: T Zhang, A Koutsoumpis, JK Oostrom
Published: 2025
Publication: IEEE Transactions ..., 2024 - ieeexplore.ieee.org
LLMs like GPT-3.5 and GPT-4 can rival or outperform task-specific AI models in assessing personality traits from asynchronous video interviews, but show uneven performance, low reliability, and potential biases, warranting cautious use in high-stakes scenarios.
Methods: The study evaluated GPT-3.5 and GPT-4 performance in assessing personality traits and interview performance using simulated AVI responses, comparing them with ratings from task-specific AI and human annotators.
Key Findings: Validity, reliability, fairness, and rating patterns of LLMs (GPT-3.5 and GPT-4) in personality assessment from asynchronous video interviews.
Limitations: Uneven performance across personality traits, insufficient test-retest reliability, and biases in LLM ratings.
Institution: Southeast University, Vrije Universiteit, Tilburg University
Research Area: LLM Personality Assessment, Human-AI Interaction, LLM
Discipline: Human-AI Interaction, Social Science,Humanities
Sample Size: 685 participants
Citations: 31