Can large language models assess personality from asynchronous video interviews? A comprehensive evaluation of validity, reliability, fairness, and rating patterns

Authors: T Zhang, A Koutsoumpis, JK Oostrom

Published: 2025

Publication: IEEE Transactions ..., 2024 - ieeexplore.ieee.org

LLMs like GPT-3.5 and GPT-4 can rival or outperform task-specific AI models in assessing personality traits from asynchronous video interviews, but show uneven performance, low reliability, and potential biases, warranting cautious use in high-stakes scenarios.

Methods: The study evaluated GPT-3.5 and GPT-4 performance in assessing personality traits and interview performance using simulated AVI responses, comparing them with ratings from task-specific AI and human annotators.

Key Findings: Validity, reliability, fairness, and rating patterns of LLMs (GPT-3.5 and GPT-4) in personality assessment from asynchronous video interviews.

Limitations: Uneven performance across personality traits, insufficient test-retest reliability, and biases in LLM ratings.

Institution: Southeast University, Vrije Universiteit, Tilburg University

Research Area: LLM Personality Assessment, Human-AI Interaction, LLM

Discipline: Human-AI Interaction, Social Science,Humanities

Sample Size: 685 participants

Citations: 31