VaViM and VaVAM: Autonomous Driving through Video Generative Modeling
Authors
Core Contributors*
* Detailed contributions are listed at the end of the page.
Links
Abstract
We explore the potential of large-scale generative video models for autonomous driving, introducing an open-source auto-regressive video model (VaViM) and its companion video-action model (VaVAM) to investigate how video pre-training transfers to real-world driving. VaViM is a simple auto-regressive video model that predicts frames using spatio-temporal token sequences. We show that it captures the semantics and dynamics of driving scenes. VaVAM, the video-action model, leverages the learned representations of VaViM to generate driving trajectories through imitation learning. Together, the models form a complete perception-to-action pipeline. We evaluate our models in open- and closed-loop driving scenarios, revealing that video-based pre-training holds promise for autonomous driving. Key insights include the semantic richness of the learned representations, the benefits of scaling for video synthesis, and the complex relationship between model size, data, and safety metrics in closed-loop evaluations.
VaViM Video Generation
Driving Comparison: UniAD vs VaVaM
1. Front scenario 0013 -- gray BEV boxes are GT vehicles for vis. purposes
UniAD #1
UniAD #2
VaVaM
Emerging behavior of avoiding oncoming vehicle
Failure Cases
Critical Failure
Model ignore command. Train and val set overlap on nuScenes?
Fails to brake
BibTeX
@article{vavam2025, title={VaViM and VaVAM: Autonomous Driving through Video Generative Modeling}, author={Bartoccioni, Florent and Ramzi, Elias and Besnier, Victor and Venkataramanan, Shashanka and Vu, Tuan-Hung and Xu, Yihong and Chambon, Loick and Gidaris, Spyros and Odabas, Serkan and Hurych, David and Marlet, Renaud and Boulch, Alexandre and Chen, Mickael and Zablocki, Eloi and Bursuc, Andrei and Valle, Eduardo and Cord, Matthieu}, journal={arXiv preprint arXiv:XXXX.XXXXX}, year={2025} }
Detailed Contributions
Project Lead (Research direction, technical roadmap, project coordination)
Florent Bartoccioni
Core contributors (All aspects of the codebase, experiments, evaluations)
Florent Bartoccioni, Elias Ramzi
Contributors
Victor Besnier -- Visual Tokenization codebase using pre-trained VQGAN; FID metric code Loick Chambon -- Data download, transfer and extraction; visualization code development Eduardo Valle -- OpenDV preprocessing Shashanka Venkataramanan -- Depth anything pseudo-GT generation Tuan-Hung Vu -- GPT adaptation from nanoGPT Yihong Xu -- nuPlan preprocessing and initial dataloader development
Technical report (Manuscript preparation, design, visualization, figures)
Florent Bartoccioni, Elias Ramzi, Victor Besnier, Shashanka Venkataramanan, Eloi Zablocki, Yihong Xu, Tuan-Hung Vu
Grant Acquisition (Grant proposals for Adastra, EuroHPC, and Jean Zay Grand Challenges)
Florent Bartoccioni, Alexandre Boulch, Eduardo Valle, Spyros Gidaris, Eloi Zablocki, Matthieu Cord, Serkan Odabas, David Hurych
Advisory (Research and organization guidance)
Eloi Zablocki, Alexandre Boulch, Mickael Chen
Senior Advisory (Research and organization guidance)
Eduardo Valle, Andrei Bursuc, Renaud Marlet, Matthieu Cord