Authors: K Dalal, D Koceja, G Hussein, J Xu, Y Zhao, Y Song, S Han, KC Cheung, J Kautz, C Guestrin, T Hashimoto, S Koyejo, Y Choi, Y Sun, X Wang
Year: 2025
Published in: ArXiv
Institution: Nvidia, Stanford University, UT Austin, University of California Berkeley, University of California San Diego
Research Area: Video Generation, Diffusion Models, Test-Time Training
Discipline: Computer Science
The paper introduces Test-Time Training (TTT) layers into Transformers to generate coherent one-minute videos from text storyboards, outperforming baselines in storytelling coherence but facing efficiency and artifact challenges.
Methods: Experimentation with Test-Time Training layers embedded in pre-trained Transformer models, evaluated using a dataset curated from Tom and Jerry cartoons and compared against Mamba 2, Gated DeltaNet, and sliding-window attention layers.
Key Findings: Effectiveness of video generation methods in creating coherent multi-scene stories in one-minute videos.
Citations: 52
Sample Size: 100
Authors: Y Wu, C Huang, F Yang, F Wang
Year: 2025
Published in: ArXiv
Institution: Nvidia, National Taiwan University
Research Area: Motion Customization of Text-to-Video Diffusion Models
Discipline: Computer Vision, Pattern Recognition
MotionMatcher is a novel framework for motion customization in text-to-video (T2V) diffusion models, using high-level spatio-temporal motion features rather than pixel-level objectives, achieving state-of-the-art performance.
Methods: Fine-tuning pre-trained text-to-video diffusion models at feature level by comparing spatio-temporal motion features instead of pixel-level objectives to address motion customization from reference videos.
Key Findings: Efficacy of motion customization in T2V models; ability to accurately capture complex motion and avoid content leakage from reference videos.