One-Minute Video Generation with Test-Time Training
Authors: K Dalal, D Koceja, G Hussein, J Xu, Y Zhao, Y Song, S Han, KC Cheung, J Kautz, C Guestrin, T Hashimoto, S Koyejo, Y Choi, Y Sun, X Wang
Published: 2025
Publication: ArXiv
The paper introduces Test-Time Training (TTT) layers into Transformers to generate coherent one-minute videos from text storyboards, outperforming baselines in storytelling coherence but facing efficiency and artifact challenges.
Methods: Experimentation with Test-Time Training layers embedded in pre-trained Transformer models, evaluated using a dataset curated from Tom and Jerry cartoons and compared against Mamba 2, Gated DeltaNet, and sliding-window attention layers.
Key Findings: Effectiveness of video generation methods in creating coherent multi-scene stories in one-minute videos.
Limitations: Artifacts in video generation due to the pre-trained model's limitations, resource constraints restricting experiments to one-minute videos, and implementation inefficiency needing improvement.
Institution: Nvidia, Stanford University ,UT Austin ,University of California Berkeley, University of California San Diego
Research Area: Video Generation, Diffusion Models, Test-Time Training
Discipline: Computer Science
Sample Size: 100 participants
Citations: 52