VaViM and VaVAM: Autonomous Driving through Video Generative Modeling

Authors

* Detailed contributions are listed at the end of the page.

Links

Abstract

We explores the potential of large-scale generative video models to enhance autonomous driving capabilities, introducing an open-source autoregressive video model (VaViM) and a companion video-action model (VaVAM). VaViM is a simple autoregressive model that predicts frames using spatio-temporal token sequences, while VaVAM leverages the learned representations to generate driving trajectories through imitation learning. Together, they offer a complete perception-to-action pipeline.

VaViM Video Generation

Emerging behavior of avoiding oncoming vehicle

We now showcase several driving demonstrations extracted from NeuroNCAP simulations. The left panel displays a bird's-eye-view where gray boxes represent objects in the scene (for visualization purposes only), the red curve indicates the intended guiding path (from which a high-level command [RIGHT,LEFT,STRAIGHT] is derived), and black dots show the model's driving trajectory decision. The right panels show the corresponding camera views from the front, front-left, and front-right perspectives (note that VaVAM only uses the front cam). In this video, despite being instructed to follow the guiding path straight ahead (shown in red), VaVAM demonstrates emergent defensive driving behavior when encountering a hazardous situation. In this scenario, an oncoming vehicle has entered our lane, creating a potential head-on collision. Without explicit programming or supervision for such scenarios, VaVAM autonomously deviates from its prescribed path to safely maneuver around the opposing vehicle.

Driving Comparison: UniAD (Hu et al., CVPR 2023) vs VaVaM

1. Front scenario 0013

Although UniAD successfully detects and predicts the trajectory of the oncoming vehicle, it is unable to execute a safe evasive maneuver to avoid the hazardous situation. Gray BEV boxes are GT vehicles for visualization purposes and not inputs to the model. More results and detailed comparison in our paper.

UniAD #1

UniAD #2

VaVaM

Failure Cases

While our model demonstrates strong overall performance, analyzing failure cases provides crucial insights into its limitations and helps identify areas for future improvement. Below, we present three representative examples that highlight different types of challenges in our framework. We propose future work directions in our paper to address the fundamental challenges exposed by these critical scenarios

#1 Collision Course with Oncoming Vehicle.

The model maintains its trajectory despite an oncoming white vehicle, making no attempt at evasive action. This is particularly intriguing because our model demonstrates collision avoidance capabilities in many similar scenarios, achieving state-of-the-art performance in frontal situations. This raises important questions about what scene elements trigger appropriate safety responses versus failures in visually similar situations.

#2 Command-Trajectory Mismatch

At this intersection, despite receiving a clear "turn right" command, the model executes a left turn instead. We hypothesize this behavior stems from the model overfitting to training data - this specific intersection likely appears in the training set but with left turns, leading to a failure to generalize to alternative commands during evaluation.

#3 Limited Emergency Braking Response

When encountering a bus positioned diagonally across the road - an obvious situation requiring a complete stop - the model maintains motion. We've observed that our model rarely initiates complete stops or emergency braking, even in scenarios where such actions would be the optimal safety response.

BibTeX

@article{vavam2025,
  title={VaViM and VaVAM: Autonomous Driving through Video Generative Modeling},
  author={Florent Bartoccioni and Elias Ramzi and Victor Besnier and Shashanka Venkataramanan and Tuan-Hung Vu and Yihong Xu and Loick Chambon and Spyros Gidaris and Serkan Odabas and David Hurych and Renaud Marlet and Alexandre Boulch and Mickael Chen and Éloi Zablocki and Andrei Bursuc and Eduardo Valle and Matthieu Cord},
  journal={arXiv preprint arXiv:2502.15672},
  year={2025}
}

Detailed Contributions

Project Lead (Research direction, technical roadmap, project coordination)

Florent Bartoccioni

Core contributors (All aspects of the codebase, experiments, evaluations)

Florent Bartoccioni, Elias Ramzi

Contributors

Victor Besnier -- Visual Tokenization codebase using pre-trained VQGAN; FID metric code Loick Chambon -- Data download, transfer and extraction; visualization code development Eduardo Valle -- OpenDV preprocessing Shashanka Venkataramanan -- Depth anything pseudo-GT generation Tuan-Hung Vu -- GPT adaptation from nanoGPT Yihong Xu -- nuPlan preprocessing and initial dataloader development

Technical report (Manuscript preparation, design, visualization, figures)

Florent Bartoccioni, Elias Ramzi, Victor Besnier, Shashanka Venkataramanan, Eloi Zablocki, Yihong Xu, Tuan-Hung Vu

Grant Acquisition (Grant proposals for Adastra, EuroHPC, and Jean Zay Grand Challenges)

Florent Bartoccioni, Alexandre Boulch, Eduardo Valle, Spyros Gidaris, Eloi Zablocki, Matthieu Cord, Serkan Odabas, David Hurych

Advisory (Research and organization guidance)

Eloi Zablocki, Alexandre Boulch, Mickael Chen

Senior Advisory (Research and organization guidance)

Eduardo Valle, Andrei Bursuc, Renaud Marlet, Matthieu Cord