One-Minute Video Generation with Test-Time Training
Abstract
Transformers today still struggle to generate one-minute videos because self-attention layers are inefficient for long context. Alternatives such as Mamba layers struggle with complex multi-scene stories because their hidden states are less expressive. We experiment with Test-Time Training (TTT) layers, whose hidden states themselves can be neural networks, therefore more expressive. Adding TTT layers into a pre-trained Transformer enables it to generate one-minute videos from text storyboards. For proof of concept, we curate a dataset based on Tom and Jerry cartoons. Compared to baselines such as Mamba 2, Gated DeltaNet, and sliding-window attention layers, TTT layers generate much more coherent videos that tell complex stories, leading by 34 Elo points in a human evaluation of 100 videos per method. Although promising, results still contain artifacts, likely due to the limited capability of the pre-trained 5B model. The efficiency of our implementation can also be improved. We have only experimented with one-minute videos due to resource constraints, but the approach can be extended to longer videos and more complex stories. Sample videos, code and annotations are available
Study specs
Experimentation with Test-Time Training layers embedded in pre-trained Transformer models, evaluated using a dataset curated from Tom and Jerry cartoons and compared against Mamba 2, Gated DeltaNet, and sliding-window attention layers.
- Authors
- K Dalal,D Koceja,G Hussein,J Xu,Y Zhao,Y Song,S Han,KC Cheung,J Kautz,C Guestrin,T Hashimoto,S Koyejo,Y Choi,Y Sun,X Wang
- Institution
- Nvidia,Stanford University,UT Austin,University of California Berkeley,University of California San Diego
- Discipline
- Computer Science
- Sample Size
- N=100
- Study Type
- methodology
- Year
- 2025
- Human Data Platform
- Prolific
- Source
- View Source Google Scholar
Measured Outcomes
Effectiveness of video generation methods in creating coherent multi-scene stories in one-minute videos.
Peer Review & Critical Discussion
Potential Selection Bias in 2023 Cohort
The participant pool shows a concerning overrepresentation of users from high-income demographics. Looking at Table 3, we can see that 78% of respondents had annual incomes above $75k, which significantly limits the generalizability of these findings to broader populations.
Non-naive Participants Issue
I've noticed a methodological concern regarding participant naivety. Given that Prolific users often complete multiple studies, there's a real risk that participants had prior exposure to similar experimental paradigms, which could confound the results.
RLHF Applicability to This Study Design
The implications for RLHF training pipelines are understated. If we accept the authors' conclusions about preference stability, this has direct consequences for how we should structure reward model training. The temporal decay effect described in Section 4.2 is particularly relevant.
Verify your expertise to join discussion
Create an account and verify your credentials to participate in peer discussions.