VaViM and VaVAM: Autonomous Driving through Video Generative Modeling

Authors

* Detailed contributions are listed at the end of the page.

Links

Abstract

We explore the potential of large-scale generative video models for autonomous driving, introducing an open-source auto-regressive video model (VaViM) and its companion video-action model (VaVAM) to investigate how video pre-training transfers to real-world driving. VaViM is a simple auto-regressive video model that predicts frames using spatio-temporal token sequences. We show that it captures the semantics and dynamics of driving scenes. VaVAM, the video-action model, leverages the learned representations of VaViM to generate driving trajectories through imitation learning. Together, the models form a complete perception-to-action pipeline. We evaluate our models in open- and closed-loop driving scenarios, revealing that video-based pre-training holds promise for autonomous driving. Key insights include the semantic richness of the learned representations, the benefits of scaling for video synthesis, and the complex relationship between model size, data, and safety metrics in closed-loop evaluations.

VaViM Video Generation

Driving Comparison: UniAD vs VaVaM

1. Front scenario 0013 -- gray BEV boxes are GT vehicles for vis. purposes

UniAD #1

UniAD #2

VaVaM

Emerging behavior of avoiding oncoming vehicle

Failure Cases

Critical Failure

Model ignore command. Train and val set overlap on nuScenes?

Fails to brake

BibTeX

@article{vavam2025,
  title={VaViM and VaVAM: Autonomous Driving through Video Generative Modeling},
  author={Bartoccioni, Florent and Ramzi, Elias and Besnier, Victor and Venkataramanan, Shashanka and Vu, Tuan-Hung and Xu, Yihong and Chambon, Loick and Gidaris, Spyros and Odabas, Serkan and Hurych, David and Marlet, Renaud and Boulch, Alexandre and Chen, Mickael and Zablocki, Eloi and Bursuc, Andrei and Valle, Eduardo and Cord, Matthieu},
  journal={arXiv preprint arXiv:XXXX.XXXXX},
  year={2025}
}

Detailed Contributions

Project Lead (Research direction, technical roadmap, project coordination)

Florent Bartoccioni

Core contributors (All aspects of the codebase, experiments, evaluations)

Florent Bartoccioni, Elias Ramzi

Contributors

Victor Besnier -- Visual Tokenization codebase using pre-trained VQGAN; FID metric code Loick Chambon -- Data download, transfer and extraction; visualization code development Eduardo Valle -- OpenDV preprocessing Shashanka Venkataramanan -- Depth anything pseudo-GT generation Tuan-Hung Vu -- GPT adaptation from nanoGPT Yihong Xu -- nuPlan preprocessing and initial dataloader development

Technical report (Manuscript preparation, design, visualization, figures)

Florent Bartoccioni, Elias Ramzi, Victor Besnier, Shashanka Venkataramanan, Eloi Zablocki, Yihong Xu, Tuan-Hung Vu

Grant Acquisition (Grant proposals for Adastra, EuroHPC, and Jean Zay Grand Challenges)

Florent Bartoccioni, Alexandre Boulch, Eduardo Valle, Spyros Gidaris, Eloi Zablocki, Matthieu Cord, Serkan Odabas, David Hurych

Advisory (Research and organization guidance)

Eloi Zablocki, Alexandre Boulch, Mickael Chen

Senior Advisory (Research and organization guidance)

Eduardo Valle, Andrei Bursuc, Renaud Marlet, Matthieu Cord