Driving on Registers

The proposed architecture is composed of three transformer blocks: one encoder (perception) and two decoders (trajectory and scoring). The perception encoder compresses perceptual information in camera-aware registers for lightweight subsequent processing in the trajectory and scoring decoders.

We present DrivoR, a simple and efficient transformer-based architecture for end-to-end autonomous driving. Our approach builds on pretrained Vision Transformers (ViTs) and introduces camera-aware register tokens that compress multi-camera features into a compact scene representation, significantly reducing downstream computation without sacrificing accuracy. These tokens drive two lightweight transformer decoders that generate and then score candidate trajectories. The scoring decoder learns to mimic an oracle and predicts interpretable sub-scores representing aspects such as safety, comfort, and efficiency, enabling behavior-conditioned driving at inference. Despite its minimal design, DrivoR outperforms or matches strong contemporary baselines across NAVSIM-v1, NAVSIM-v2, and the photorealistic closed-loop HUGSIM benchmark. Our results show that a pure-transformer architecture, combined with targeted token compression, is sufficient for accurate, efficient, and adaptive end-to-end driving. Code and checkpoints will be made available.

DrivoR achieves state-of-the-art results on NAVSIM-v1, NAVSIM-v2, and HUGSIM benchmarks

Register Specialization: DrivoR compresses each camera to 16 scene tokens using DiNO registers. We visualize the inter-token cosine similarity for each camera. We see that as cameras become conceptually "less important", registers collapse to more similar representations, highlighting learned camera-specific compression. Cosine similarity is computed on navval validation set.

Attention Maps: Image patch to camera token cross attention layers, taken from the final cross attention layers. The front-camera tokens specialize to distinct regions (traffic light, lead vehicle, road edges), while back-camera tokens largely collapse to the same features, aside from a single distinct token, further highlighting camera specific compression.

Disentanglement: DrivoR disentangles trajectory and scoring by reprojecting decoded trajectories and adding a stop gradient. We see the importance of this disentanglement: using cross attention between trajectory queries and camera tokens, we see that different cameras are used between trajectory generation and scoring.

Behavior Tuning: Results of safety oriented fine tuning, where scoring coefficients are adjusted to vary the behavior of the driving policy. Dark blue was tuned on warmup-two-stage, light blue is our NAVSIM-v1 model. The result of our behavior tuning is an agent that drives more safely with fewer collisions, but less aggresively, with lower progress.

Zero shot generalization of the DrivoR model to closed loop driving in the photorealistic HUGSIM simulator.

Acknowledgements

We thank Loick Chambon for constant support throughout the project and Lan Feng for helpful discussions. This work was granted access to the HPC resources of IDRIS under the allocations AD011016241 and AD011016239R1 made by GENCI. We acknowledge EuroHPC Joint Undertaking for awarding the project ID EHPC-REG-2024R02-210 access to Karolina, Czech Republic. Funded by the European Union. Views and opinions expressed are however those of the author(s) only and do not necessarily reflect those of the European Union or the European Commission. Neither the European Union nor the European Commission can be held responsible for them. This work was supported by the European Union’s Horizon Europe research and innovation programme under grant agreement No 101214398 (ELLIOT).

BibTeX

@inproceedings{kirby2026drivor,
  title   = {Driving on Registers},
  author  = {Kirby, Ellington and Boulch, Alexandre and Xu, Yihong and Yin, Yuan and Puy, Gilles and Zablocki, Éloi and Bursuc, Andrei and Gidaris, Spyros and Marlet, Renaud and Bartoccioni, Florent and Cao, Anh-Quan and Samet, Nermin and Vu, Tuan-Hung and Cord, Matthieu},
  booktitle = {CVPR},
  year    = {2026}
}

Driving on Registers

Simple and efficient transformer based end-to-end driving

CVPR 2026

DrivoR presents a simple transformer based architecture with no complex dependencies or costly BEV projections. Leveraging pre-trained ViT backbones, DrivoR compresses a four camera scene representation more than 250x, to just 64 total tokens.

Abstract

NAVSIM-v1

NAVSIM-v2

Closed-Loop Driving

Acknowledgements

BibTeX