valeo.ai

valeo.ai at CVPR 2026

2026-04-16T00:00:00+00:00

The IEEE / CVF Computer Vision and Pattern Recognition Conference (CVPR) is a key event for researchers and engineers working on computer vision and machine learning. At the 2026 edition, the valeo.ai team will present nine papers in the main conference and one in the Findings track.

The team will be at CVPR to present these works, exchange ideas, and share our exciting ongoing research. We look forward to seeing you in Denver!

NAF: Zero-Shot Feature Upsampling via Neighborhood Attention Filtering

Highlight

Authors: Loick Chambon, Paul Couairon, Eloi Zablocki, Alexandre Boulch, Nicolas Thome, Matthieu Cord

[Paper] [Code]

Vision Foundation Models produce spatially downsampled representations that create challenges for pixel-level tasks. We introduce Neighborhood Attention Filtering (NAF), a method that learns adaptive spatial-and-content weights through Cross-Scale Neighborhood Attention and Rotary Position Embeddings, guided solely by high-resolution input images. NAF operates as a zero-shot upsampler compatible with any VFM without retraining, achieving state-of-the-art performance across multiple downstream tasks while maintaining efficiency at 18 FPS for 2K feature maps. The method also demonstrates strong results in image restoration applications.

Driving on Registers

Authors: Ellington Kirby, Alexandre Boulch, Yihong Xu, Yuan Yin, Gilles Puy, Eloi Zablocki, Andrei Bursuc, Spyros Gidaris, Renaud Marlet, Florent Bartoccioni, Anh-Quan Cao, Nermin Samet, Tuan-Hung Vu, Matthieu Cord

[Paper] [Code] [Project page]

We present DrivoR, a simple and efficient transformer-based architecture for end-to-end autonomous driving. Our approach builds on pretrained Vision Transformers (ViTs) and introduces camera-aware register tokens that compress multi-camera features into a compact scene representation, significantly reducing downstream computation without sacrificing accuracy. These tokens drive two lightweight transformer decoders that generate and then score candidate trajectories. The scoring decoder learns to mimic an oracle and predicts interpretable sub-scores representing aspects such as safety, comfort, and efficiency, enabling behavior-conditioned driving at inference. Despite its minimal design, DrivoR outperforms or matches strong contemporary baselines across NAVSIM-v1, NAVSIM-v2, and the photorealistic closed-loop HUGSIM benchmark. Our results show that a pure-transformer architecture, combined with targeted token compression, is sufficient for accurate, efficient, and adaptive end-to-end driving.

OccAny: Generalized Unconstrained Urban 3D Occupancy

Authors: Anh-Quan Cao, Tuan-Hung Vu

[Paper] [Code] [Project page]

Relying on in-domain annotations and precise sensor-rig priors, existing 3D occupancy prediction methods are limited in both scalability and out-of-domain generalization. While recent visual geometry foundation models exhibit strong generalization capabilities, they were mainly designed for general purposes and lack one or more key ingredients required for urban occupancy prediction, namely metric prediction, geometry completion in cluttered scenes and adaptation to urban scenarios. We address this gap and present OccAny, the first unconstrained urban 3D occupancy model capable of operating on out-of-domain uncalibrated scenes to predict and complete metric occupancy coupled with segmentation features. OccAny is versatile and can predict occupancy from sequential, monocular, or surround-view images. Extensive experiments demonstrate that OccAny outperforms all visual geometry baselines on the 3D occupancy prediction task, while remaining competitive with in-domain self-supervised methods across three input settings on two established urban occupancy prediction datasets.

Franca: Nested Matryoshka Clustering for Scalable Visual Representation Learning

Authors: Shashanka Venkataramanan, Valentinos Pariza, Mohammadreza Salehi, Lukas Knobel, Spyros Gidaris, Elias Ramzi, Andrei Bursuc, Yuki M. Asano

[Paper] [Code]

We present Franca, the first fully open-source (data, code, weights) vision foundation model that matches and in many cases surpasses the performance of state-of-the-art proprietary models, e.g., DINOv2, CLIP, SigLIPv2. Our approach is grounded in a transparent training pipeline using publicly available data: ImageNet-21K and a subset of ReLAION-2B. We introduce a parameter-efficient, multi-head clustering projector based on nested Matryoshka representations that progressively refines features into increasingly fine-grained clusters without increasing model size. Additionally, we propose a novel positional disentanglement strategy that explicitly removes positional biases from dense representations, leading to consistent gains on several downstream benchmarks. Our contributions establish a new standard for transparent, high-performance vision models and open a path toward more reproducible and generalizable foundation models.

MAD: Motion Appearance Decoupling for efficient Driving World Models

Authors: Ahmad Rahimi, Valentin Gerard, Eloi Zablocki, Matthieu Cord, Alexandre Alahi

[Paper] [Code] [Project page]

Recent video diffusion models generate photorealistic, temporally coherent videos, yet they fall short as reliable world models for autonomous driving, where structured motion and physically consistent interactions are essential. We propose an efficient adaptation framework that converts generalist video diffusion models into controllable driving world models with minimal supervision. The key idea is to decouple motion learning from appearance synthesis. First, the model is adapted to predict structured motion in a simplified form: videos of skeletonized agents and scene elements, focusing learning on physical and social plausibility. Then, the same backbone is reused to synthesize realistic RGB videos conditioned on these motion sequences. This two-stage process mirrors a reasoning-rendering paradigm: first infer dynamics, then render appearance. Our experiments show this decoupled approach is exceptionally efficient: adapting SVD, we match prior SOTA models with less than 6% of their compute. Scaling to LTX, our MAD-LTX model outperforms all open-source competitors, and supports a comprehensive suite of text, ego, and object controls.

StableMTL: Repurposing Latent Diffusion Models for Multi-Task Learning from Partially Annotated Synthetic Datasets

Authors: Anh-Quan Cao, Ivan Lopes, Raoul de Charette

[Paper] [Code]

Multi-task learning for dense prediction is limited by the need for extensive annotation for every task, though recent works have explored training with partial task labels. Leveraging the generalization power of diffusion models, we extend the partial learning setup to a zero-shot setting, training a multi-task model on multiple synthetic datasets, each labeled for only a subset of tasks. Our method, StableMTL, repurposes image generators for latent regression, adapting a denoising framework with task encoding, per-task conditioning and a tailored training scheme. Instead of per-task losses requiring careful balancing, a unified latent loss is adopted, enabling seamless scaling to more tasks. To encourage inter-task synergy, we introduce a multi-stream model with a task-attention mechanism that converts N-to-N task interactions into efficient 1-to-N attention, promoting effective cross-task sharing. StableMTL outperforms baselines on 7 tasks across 8 benchmarks.

LiDAS: Lighting-driven Dynamic Active Sensing for Nighttime Perception

Authors: Simon de Moreau, Andrei Bursuc, Hafid El-Idrissi, Fabien Moutarde

[Paper] [Project page]

Camera-based perception in autonomous driving suffers a steep performance drop at night, when low light degrades image quality and existing solutions either rely on costly hardware upgrades or post-hoc image enhancement. We introduce LiDAS, a closed-loop active illumination system that dynamically predicts the optimal lighting pattern for visual perception, concentrating light on objects of interest while reducing it in empty regions. LiDAS integrates seamlessly with standard perception models and high-definition headlights, enabling zero-shot nighttime generalization of daytime-trained networks. On a realistic nighttime simulator and on real driving sequences, LiDAS yields +18.7% mAP50 and +5.0% mIoU over standard low-beam at equal power, while also enabling up to 40% energy savings at matched performance. Our approach turns commodity headlights into active perception devices, paving the way for robust nighttime autonomous perception.

LAM3C: 3D sans 3D Scans: Scalable Pre-training from Video-Generated Point Clouds

Authors: Ryosuke Yamada, Kohsuke Ide, Yoshihiro Fukuhara, Hirokatsu Kataoka, Gilles Puy, Andrei Bursuc, Yuki M. Asano

[Paper] [Code] [Project page]

We investigate the use of unlabeled videos for learning 3D representations without ever resorting to a 3D sensor. To this end, we introduce LAM3C, a self-supervised learning framework that operates directly on point clouds reconstructed from video. To support this, we curate RoomTours, a new dataset of 49,219 scenes generated from web-collected room-walkthrough videos using a feed-forward reconstruction model. Pre-training on this video-derived data introduces challenges: noisy geometry and incomplete coverage destabilize standard SSL objectives. We address these by proposing a noise-regularized loss that stabilizes representation learning by enforcing local geometric smoothness and ensuring feature stability under noisy point clouds. Despite never seeing real 3D scans, LAM3C surpasses prior self-supervised methods on indoor semantic and instance segmentation, demonstrating that unlabeled videos are a powerful and scalable resource for 3D self-supervised learning.

ICM: Attention, May I Have Your Decision? Localizing Generative Choices in Diffusion Models

Authors: Katarzyna Zaleska, Lukasz Popek, Monika Wysoczanska, Kamil Deja

[Paper] [Code]

Text-to-image diffusion models routinely face ambiguous prompts that under-specify visual details, forcing them to make implicit decisions about unspecified attributes. We posit that this decision-making is computationally localized within the model’s architecture rather than being uniformly distributed. We introduce a probing technique that identifies the layers exhibiting the highest attribute separability, and find that self-attention layers are the locus where these implicit choices are resolved. Building on this insight, we propose ICM (Implicit Choice-Modification), a targeted steering method that modifies only the identified self-attention layers to control the model’s implicit decisions. Experiments demonstrate that ICM achieves superior debiasing performance with fewer artifacts than existing approaches, providing both a new tool for controllable generation and a better understanding of where generative choices live in diffusion models.

PoM: A Linear-Time Replacement for Attention with the Polynomial Mixer

CVPR 2026 Findings

Authors: David Picard, Nicolas Dufour, Lucas Degeorge, Arijit Ghosh, Davide Allegro, Tom Ravaud, Yohann Perron, Corentin Sautier, Zeynep Sonat Baltaci, Fei Meng, Syrine Kalleli, Marta Lopez-Rauhut, Thibaut Loiseau, Segolene Albouy, Raphael Baena, Elliot Vincent, Loic Landrieu

[Paper] [Code]

This paper introduces the Polynomial Mixer (PoM), a novel token mixing mechanism with linear complexity that serves as a drop-in replacement for self-attention. PoM aggregates input tokens into a compact representation through a learned polynomial function, from which each token retrieves contextual information. We prove that PoM satisfies the contextual mapping property, ensuring that transformers equipped with PoM remain universal sequence-to-sequence approximators. We replace standard self-attention with PoM across five diverse domains: text generation, handwritten text recognition, image generation, 3D modeling, and Earth observation. PoM matches the performance of attention-based models while drastically reducing computational cost when working with long sequences.

valeo.ai at NeurIPS 2025

2025-12-02T00:00:00+00:00

The Neural Information Processing Systems Conference (NeurIPS) is a major inter-disciplinary event that brings together researchers and practicioners in machine learning, computer vision, natural language processing, optimization, statistics, but also neuroscience, natural sciences, social sciences, etc. This year, at the 39th edition of NeurIPS, the valeo.ai team will present 5 papers in the main conference and 1 in the workshops. We are honored to announce that our IPA paper on efficient foundation model adaptation has received the outstanding paper award at the CCFM workshop. Our team contributed to the technical program committee with multiple reviewers out of whom 1 was awarded top reviewer and 2 as top area chairs.

The team will be at NeurIPS to present these works, exchange ideas, and share our exciting ongoing research. We look forward to seeing you in San Diego!

JAFAR: Jack up Any Feature at Any Resolution

Authors: Paul Couairon, Loick Chambon , Louis Serrano, Jean-Emmanuel Haugeard, Matthieu Cord, Nicolas Thome

[Paper] [Code]

Foundation Vision Encoders have become essential for a wide range of dense vision tasks. However, their low-resolution spatial feature outputs necessitate feature upsampling to produce the high-resolution modalities required for downstream tasks. In this work, we introduce JAFAR—a lightweight and flexible feature upsampler that enhances the spatial resolution of visual features from any Foundation Vision Encoder to an arbitrary target resolution. JAFAR employs an attention-based module designed to promote semantic alignment between high-resolution queries—derived from low-level image features—and semantically enriched low-resolution keys, using Spatial Feature Transform (SFT) modulation. Notably, despite the absence of high-resolution supervision, we demonstrate that learning at low upsampling ratios and resolutions generalizes remarkably well to significantly higher output scales. Extensive experiments show that JAFAR effectively recovers fine-grained spatial details and consistently outperforms existing feature upsampling methods across a diverse set of downstream tasks.

DINO-Foresight: Looking into the Future with DINO

Authors: Efstathios Karypidis, Ioannis Kakogeorgiou, Spyros Gidaris, Nikos Komodakis

[Paper] [Code]

Predicting future dynamics is crucial for applications like autonomous driving and robotics, where understanding the environment is key. Existing pixel-level methods are computationally expensive and often focus on irrelevant details. To address these challenges, we introduce DINO-Foresight, a novel framework that operates in the semantic feature space of pretrained Vision Foundation Models (VFMs). Our approach trains a masked feature transformer in a self-supervised manner to predict the evolution of VFM features over time. By forecasting these features, we can apply off-the-shelf, task-specific heads for various scene understanding tasks. In this framework, VFM features are treated as a latent space, to which different heads attach to perform specific tasks for future-frame analysis. Extensive experiments show the very strong performance, robustness and scalability of our framework.

Learning to Steer: Input-dependent Steering for Multimodal LLMs

Authors: Jayneel Parekh, Pegah Khayatan, Mustafa Shukor, Arnaud Dapogny, Alasdair Newson, Matthieu Cord

[Paper] [Code]

Steering has emerged as a practical approach to enable post-hoc guidance of LLMs towards enforcing a specific behavior. However, it remains largely underexplored for multimodal LLMs (MLLMs); furthermore, existing steering techniques, such as mean steering, rely on a single steering vector, applied independently of the input query. This paradigm faces limitations when the desired behavior is dependent on the example at hand. For example, a safe answer may consist in abstaining from answering when asked for an illegal activity, or may point to external resources or consultation with an expert when asked about medical advice. In this paper, we investigate a fine-grained steering that uses an input-specific linear shift. This shift is computed using contrastive input-specific prompting. However, the input-specific prompts required for this approach are not known at test time. Therefore, we propose to train a small auxiliary module to predict the input-specific steering vector. Our approach, dubbed as L2S (Learn-to-Steer), demonstrates that it reduces hallucinations and enforces safety in MLLMs, outperforming other static baselines.

Boosting Generative Image Modeling via Joint Image-Feature Synthesis

Authors: Theodoros Kouzelis, Efstathios Karypidis, Ioannis Kakogeorgiou, Spyros Gidaris, Nikos Komodakis

[Paper] [Code]

Latent diffusion models (LDMs) dominate high-quality image generation, yet integrating representation learning with generative modeling remains a challenge. We introduce a novel generative image modeling framework that seamlessly bridges this gap by leveraging a diffusion model to jointly model low-level image latents (from a variational autoencoder) and high-level semantic features (from a pretrained self-supervised encoder like DINO). Our latent-semantic diffusion approach learns to generate coherent image-feature pairs from pure noise, significantly enhancing both generative quality and training efficiency, all while requiring only minimal modifications to standard Diffusion Transformer architectures. By eliminating the need for complex distillation objectives, our unified design simplifies training and unlocks a powerful new inference strategy: Representation Guidance, which leverages learned semantics to steer and refine image generation. Evaluated in both conditional and unconditional settings, our method delivers substantial improvements in image quality and training convergence speed, establishing a new direction for representation-aware generative modeling.

Multi-Token Prediction Needs Registers

Authors: Anastasios Gerontopoulos, Spyros Gidaris, Nikos Komodakis

[Paper] [Code]

Multi-token prediction has emerged as a promising objective for improving language model pretraining, but its benefits have not consistently generalized to other settings such as fine-tuning. In this paper, we propose MuToR, a simple and effective approach to multi-token prediction that interleaves learnable register tokens into the input sequence, each tasked with predicting future targets. Compared to existing methods, MuToR offers several key advantages: it introduces only a negligible number of additional parameters, requires no architectural changes–ensuring compatibility with off-the-shelf pretrained language models–and remains aligned with the next-token pretraining objective, making it especially well-suited for supervised fine-tuning. Moreover, it naturally supports scalable prediction horizons. We demonstrate the effectiveness and versatility of MuToR across a range of use cases, including supervised fine-tuning, parameter-efficient fine-tuning (PEFT), and pretraining, on challenging generative tasks in both language and vision domains.

IPA: An Information-Preserving Input Projection Framework for Efficient Foundation Model Adaptation

NeurIPS 2025 Workshop on Continual and Compatible Foundation Model Updates (CCFM)

Authors: Yuan Yin, Shashanka Venkataramanan, Tuan-Hung Vu, Andrei Bursuc, Matthieu Cord

[Paper]

Parameter-efficient fine-tuning (PEFT) methods, such as LoRA, reduce adaptation cost by injecting low-rank updates into pretrained weights. However, LoRA’s down-projection is randomly initialized and data-agnostic, discarding potentially useful information. Prior analyses show that this projection changes little during training, while the up-projection carries most of the adaptation, making the random input compression a performance bottleneck. We propose IPA, a feature-aware projection framework that explicitly preserves information in the reduced hidden space. In the linear case, we instantiate IPA with algorithms approximating top principal components, enabling efficient projector pretraining with negligible inference overhead. Across language and vision benchmarks, IPA consistently improves over LoRA and DoRA, achieving on average 1.5 points higher accuracy on commonsense reasoning and 2.3 points on VTAB-1k, while matching full LoRA performance with roughly half the trainable parameters when the projection is frozen.

valeo.ai at ICCV 2025

2025-10-08T00:00:00+00:00

The International Conference on Computer Vision (ICCV) is a leading conference that brings together researchers and practitioners in computer vision and machine learning. At the 2025 edition, the valeo.ai team will present five papers in the main conference. We are also co-organizing the Learning to See: Advancing Spatial Understanding for Embodied Intelligence workshop, and contributing to the Foundational Data for Industrial Tech Transfer workshop with a keynote on Towards openness of vision foundation models.

The team will be at ICCV to present these works, exchange ideas, and share our exciting ongoing research. We look forward to seeing you in Honolulu!

DIP: Unsupervised Dense In-Context Post-training of Visual Representations

Authors: Sophia Sirko-Galouchenko, Antonin Vobecky, Andrei Bursuc, Nicolas Thome, Spyros Gidaris

[Paper] [Code]

We introduce DIP, a novel unsupervised post-training method designed to enhance dense image representations in large-scale pretrained vision encoders for in-context scene understanding. Unlike prior approaches that rely on complex self-distillation architectures, our method trains the vision encoder using pseudo-tasks that explicitly simulate downstream in-context scenarios, inspired by meta-learning principles. To enable post-training on unlabeled data, we propose an automatic mechanism for generating in-context tasks that combines a pretrained diffusion model and the vision encoder itself. DIP is simple, unsupervised, and computationally efficient, requiring less than 9 hours on a single A100 GPU. By learning dense representations through pseudo in-context tasks, it achieves strong performance across a wide variety of downstream real-world in-context scene understanding tasks. It outperforms both the initial vision encoder and prior methods, offering a practical and effective solution for improving dense representations.

GaussRender: Learning 3D Occupancy with Gaussian Rendering

Authors: Loïck Chambon, Éloi Zablocki, Alexandre Boulch, Mickaël Chen, Matthieu Cord

[Paper] [Code]

Understanding the 3D geometry and semantics of driving scenes is critical for safe autonomous driving. Recent advances in 3D occupancy prediction improve scene representation but often suffer from spatial inconsistencies, leading to floating artifacts and poor surface localization. Existing voxel-wise losses (e.g., cross-entropy) fail to enforce geometric coherence. GaussRender is a module that improves 3D occupancy learning by enforcing projective consistency. The key idea is to project both predicted and ground-truth 3D occupancy into 2D camera views for supervision, penalizing inconsistent 3D configurations and enforcing coherent 3D structure. To achieve this efficiently, GaussRender leverages differentiable rendering with Gaussian splatting. It integrates seamlessly with existing architectures, requires no inference-time modifications, and significantly improves geometric fidelity across multiple benchmarks (SurroundOcc-nuScenes, Occ3D-nuScenes, SSCBench-KITTI360) and 3D occupancy models (TPVFormer, SurroundOcc, Symphonies).

Scene visualization 1

Scene visualization 2

GaussRender is a 3D Occupancy module that can be plugged into any 3D Occupancy model to enhance its predictions and ensure 2D-3D consistency while improving mIoU, IoU, and RayIoU.

GaussRender can be plugged into any model. The core idea is to transform voxels into Gaussians before performing a depth and semantic rendering.

Results

GaussRender can be plugged into any 3D model. Dedicated experiments on multiple 3D benchmarks (SurroundOcc-nuScenes, Occ3D-nuScenes, SSCBench-KITTI360) and models (TPVFormer, SurroundOcc, Symphonies) demonstrate its performance.

Occ3D-nuScenes

Models	TPVFormer (ours)	TPVFormer	SurroundOcc (ours)	SurroundOcc	OccFormer	RenderOcc
Type	w/ GaussRender	base	w/ GaussRender	base	base	base
mIoU	30.48 🥇 (+2.65)	27.83	30.38 🥈 (+1.17)	29.21	21.93	26.11
RayIoU	38.3 🥇 (+1.1)	37.2	37.5 🥈 (+2.0)	35.5	-	19.5

SurroundOcc-nuScenes

Models	TPVFormer (ours)	TPVFormer	SurroundOcc (ours)	SurroundOcc	OccFormer	GaussianFormerv2
Type	w/ GaussRender	base	w/ GaussRender	base	base	base
IoU	32.05 🥈 (+1.19)	30.86	32.61 🥇 (+1.12)	31.49	31.39	30.56
mIoU	20.58 🥈 (+3.48)	17.10	20.82 🥇 (+0.52)	20.30	19.03	20.02

SSCBench-KITTI360

Models	SurroundOcc (ours)	SurroundOcc	Symphonies (ours)	Symphonies	OccFormer	MonoScene
Type	w/ GaussRender	base	w/ GaussRender	base	base	base
IoU	38.62 (+0.11)	38.51	44.08 🥇 (+0.68)	43.40 🥈	40.27	37.87
mIoU	13.34 (+0.26)	13.08	18.11 🥇 (+0.29)	17.82 🥈	13.81	12.31

MoSiC: Optimal-Transport Motion Trajectories for Dense Self-Supervised Learning

Authors: Mohammadreza Salehi, Shashanka Venkataramanan, Ioana Simion, Efstratios Gavves, Cees Snoek, Yuki Asano

[Paper] [Code]

Dense self-supervised learning has shown great promise for learning pixel- and patch-level representations, but extending it to videos remains challenging due to complex motion dynamics. Existing approaches struggle under object deformations, occlusions, and camera movement, leading to inconsistent feature learning over time. In this work, we introduce MoSiC, a motion-guided self-supervised framework that clusters dense point tracks to learn spatiotemporally consistent representations. Using an off-the-shelf point tracker, we extract long-range motion trajectories and optimize feature clustering with a momentum-encoder-based optimal transport mechanism. Temporal coherence is enforced by propagating cluster assignments along tracked points, ensuring feature consistency across views despite viewpoint changes. By leveraging motion as an implicit supervisory signal and initializing from strong image-pretrained models, MoSiC learns robust representations that generalize across frames. Our approach improves state-of-the-art performance by 1% to 6% across six image and video datasets and four evaluation benchmarks.

FLOSS: Free Lunch in Open-vocabulary Semantic Segmentation

Authors: Yasser Benigmim, Mohammad Fahes, Tuan-Hung Vu, Andrei Bursuc, Raoul de Charette

[Paper] [Code]

In Open-Vocabulary Semantic Segmentation (OVSS), class-wise text embeddings are usually averaged over multiple templates (e.g., “a photo of ", "a sketch of ") to form classifiers. We show that for each class, there exist single-template classifiers—termed class-experts—that outperform conventional averaged classifiers. To identify these class-experts without labeled data, we leverage class-wise prediction entropy and select the lowest-entropy classifiers as the most reliable. We then fuse the outputs of these class-experts using a novel plug-and-play process called FLOSS. FLOSS is complementary to existing OVSS methods, requiring no additional labels or training, yet consistently improves performance. Extensive experiments demonstrate that FLOSS enhances state-of-the-art OVSS models, generalizes across datasets with distribution shifts, and yields strong improvements in low-data scenarios with only a few unlabeled images.

Analyzing Fine-tuning Representation Shift for Multimodal LLMs Steering Alignment

Authors: Pegah Khayatan, Mustafa Shukor, Jayneel Parekh, Matthieu Cord

[Paper] [Code]

Multimodal LLMs have achieved remarkable proficiency in understanding multimodal inputs, yet little attention has been given to explaining how their internal representations evolve during training. Most explainability research focuses only on final model states, ignoring dynamic representational shifts. In this work, we systematically analyze the evolution of hidden state representations during fine-tuning, revealing how models adapt to new multimodal tasks. Using a concept-based approach, we map hidden states to interpretable visual and textual concepts, enabling us to trace concept changes across modalities as training progresses. We introduce shift vectors to capture these changes, allowing recovery of fine-tuned concepts from the original model. Furthermore, we demonstrate practical applications in model steering, such as adjusting answer types, caption styles, or biasing responses without additional training. This work provides novel insights into multimodal representation adaptation and offers tools for interpreting and controlling fine-tuned multimodal LLMs.

valeo.ai at ICLR 2025

2025-04-01T00:00:00+00:00

The International Conference on Learning Representations (ICLR) is a leading conference that brings together researchers and practitioners in deep learning, representation learning, and artificial intelligence. It covers a wide range of topics, including optimization, generative models, interpretability, robustness. This year, at the thirteen edition of ICLR, the valeo.ai team will present 5 papers in the main conference.

We will be happy to discuss more about these projects and ideas, and share our exciting ongoing research. Take a quick view of our papers below and come meet us at the posters or catch us for a coffee in the hallways.

Halton Scheduler For Masked Generative Image Transformer

Authors: Victor Besnier Mickael Chen David Hurych Eduardo Valle Matthieu Cord

[Paper] [Code] [Project page]

Masked Generative Image Transformers (MaskGIT) have gained popularity for their fast and efficient image generation capabilities. However, the sampling strategy used to progressively "unmask" tokens in these models plays a crucial role in determining image quality and diversity. Our new research paper, introduces the Halton Scheduler—a novel approach that significantly enhances MaskGIT's image generation performance.

From Confidence to Halton: What’s New?

Traditional MaskGIT uses a Confidence scheduler, which selects tokens based on logit distribution but tends to cluster token selection, leading to reduced image diversity. The Halton Scheduler addresses this by leveraging low-discrepancy sequences, the Halton sequence, to distribute token selection more uniformly across the image.

MaskGIT using our Halton scheduler on ImageNet 256.

Key Insights and Benefits

Improved Image Quality and Diversity: The Halton scheduler reduces clustering of sampled tokens, enhancing image sharpness and background richness.
No Retraining Required: This scheduler can be integrated as a drop-in replacement for the existing MaskGIT sampling strategy.
Faster and More Balanced Sampling: By reducing token correlation, the Halton Scheduler allows MaskGIT to progressively add fine details while avoiding local sampling errors.

Figure 2: MaskGIT using our Halton scheduler for text-to-image.

Figure 3: MaskGIT using the Confidence scheduler for text-to-image.

Results: ImageNet and COCO Benchmarks

On benchmark datasets like ImageNet (256×256) and COCO, the Halton Scheduler outperforms the baseline Confidence scheduler:

Reduced Fréchet Inception Distance (FID): Indicating better image realism.
Improved Precision and Recall: Reflecting a more diverse image generation.

LLM-wrapper: Black-Box Semantic-Aware Adaptation of Vision-Language Models for Referring Expression Comprehension

Authors: Amaia Cardiel Éloi Zablocki Elias Ramzi Oriane Siméoni Matthieu Cord

[Paper] [Code] [Project page]

Vision Language Models (VLMs) have demonstrated remarkable capabilities in various open-vocabulary tasks, yet their zero-shot performance lags behind task-specific fine-tuned models, particularly in complex tasks like Referring Expression Comprehension (REC). Fine-tuning usually requires “white-box” access to the model’s architecture and weights, which is not always feasible due to proprietary or privacy concerns. In this work, we propose LLM-wrapper, a method for “black-box” adaptation of VLMs for the REC task using Large Language Models (LLMs). LLM-wrapper capitalizes on the reasoning abilities of LLMs, improved with a light fine-tuning, to select the most relevant bounding box matching the referring expression, from candidates generated by a zero-shot black-box VLM. Our approach offers several advantages: it enables the adaptation of closed-source models without needing access to their internal workings, it is versatile as it works with any VLM, it transfers to new VLMs and datasets, and it allows for the adaptation of an ensemble of VLMs. We evaluate LLM-wrapper on multiple datasets using different VLMs and LLMs, demonstrating significant performance improvements and highlighting the versatility of our method. While LLM-wrapper is not meant to directly compete with standard white-box fine-tuning, it offers a practical and effective alternative for black-box VLM adaptation.

MOCA: Self-supervised Representation Learning by Predicting Masked Online Codebook Assignments

Authors: Spyros Gidaris Andrei Bursuc Oriane Siméoni Antonin Vobecky Nikos Komodakis Matthieu Cord Patrick Pérez

[Paper] [Code] [Project page]

Self-supervised learning can be used for mitigating the greedy needs of Vision Transformer networks for very large fully-annotated datasets. Different classes of self-supervised learning offer representations with either good contextual reasoning properties, e.g., using masked image modeling strategies, or invariance to image perturbations, e.g., with contrastive methods. In this work, we propose a single-stage and standalone method, MOCA, which unifies both desired properties using novel mask-and-predict objectives defined with high-level features (instead of pixel-level details). Moreover, we show how to effectively employ both learning paradigms in a synergistic and computation-efficient way. Doing so, we achieve new state-of-the-art results on low-shot settings and strong experimental results in various evaluation protocols with a training that is at least 3 times faster than prior methods.

Learning a Neural Solver for Parametric PDEs to Enhance Physics-Informed Methods

Authors: Lise Le Boudec Emmanuel de Bezenac Louis Serrano Ramon Daniel Regueiro-Espino Yuan Yin Patrick Gallinari

[Paper] [Code] [Project page]

Physics-informed deep learning often faces optimization challenges due to the complexity of solving partial differential equations (PDEs), which involve exploring large solution spaces, require numerous iterations, and can lead to unstable training. These challenges arise particularly from the ill-conditioning of the optimization problem, caused by the differential terms in the loss function. To address these issues, we propose learning a solver, i.e., solving PDEs using a physics-informed iterative algorithm trained on data. Our method learns to condition a gradient descent algorithm that automatically adapts to each PDE instance, significantly accelerating and stabilizing the optimization process and enabling faster convergence of physics-aware models. Furthermore, while traditional physics-informed methods solve for a single PDE instance, our approach addresses parametric PDEs. Specifically, our method integrates the physical loss gradient with the PDE parameters to solve over a distribution of PDE parameters, including coefficients, initial conditions, or boundary conditions. We demonstrate the effectiveness of our method through empirical experiments on multiple datasets, comparing training and test-time optimization performance.

ToddlerDiffusion: Interactive Structured Image Generation with Cascaded Schrödinger Bridge

Authors: Eslam Abdelrahman Liangbing Zhao Vincent Tao Hu Matthieu Cord Patrick Perez Mohamed Elhoseiny

[Paper] [Code] [Project page]

Diffusion models break down the challenging task of generating data from high-dimensional distributions into a series of easier denoising steps. Inspired by this paradigm, we propose a novel approach that extends the diffusion framework into modality space, decomposing the complex task of RGB image generation into simpler, interpretable stages. Our method, termed ToddlerDiffusion, cascades modality-specific models, each responsible for generating an intermediate representation, such as contours, palettes, and detailed textures, ultimately culminating in a high-quality RGB image. Instead of relying on the naive LDM concatenation conditioning mechanism to connect the different stages together, we employ Schrödinger Bridge to determine the optimal transport between different modalities. Although employing a cascaded pipeline introduces more stages, which could lead to a more complex architecture, each stage is meticulously formulated for efficiency and accuracy, surpassing Stable-Diffusion (LDM) performance. Modality composition not only enhances overall performance but enables emerging proprieties such as consistent editing, interaction capabilities, high-level interpretability, and faster convergence and sampling rate. Extensive experiments on diverse datasets, including LSUN-Churches, ImageNet, CelebHQ, and LAION-Art, demonstrate the efficacy of our approach, consistently outperforming state-of-the-art methods. For instance, ToddlerDiffusion achieves notable efficiency, matching LDM performance on LSUN-Churches while operating 2× faster with a 3× smaller architecture.

valeo.ai at NeurIPS 2024

2024-12-04T00:00:00+00:00

No Train, all Gain: Self-Supervised Gradients Improve Deep Frozen Representations

Authors: Walter Simoncini Spyros Gidaris Andrei Bursuc Yuki M. Asano

[Paper] [Code] [Project page]

This paper introduces FUNGI: Features from UNsupervised GradIents, a method to enhance the features of vision encoders by leveraging self-supervised gradients. Our method is simple: given any pretrained model, we first compute gradients from various self-supervised objectives for each input. These are projected to a lower dimension and then concatenated with the model’s embedding. The resulting features are evaluated on k-nearest neighbor classification over 11 datasets from vision, 5 from natural language processing, and 2 from audio. Across backbones spanning various sizes and pretraining strategies, FUNGI features provide consistent performance improvements over the embeddings. We also show that using FUNGI features can benefit linear classification and image retrieval, and that they significantly improve the retrieval-based in-context scene understanding abilities of pretrained models, for example improving upon DINO by +17% for semantic segmentation — without any training.

ManiPose: Manifold-Constrained Multi-Hypothesis 3D Human Pose Estimation

Authors: Cédric Rommel Victor Letzelter Nermin Samet Renaud Marlet Matthieu Cord Patrick Pérez Eduardo Valle

[Paper] [Code] [Project page]

We propose ManiPose, a manifold-constrained multi-hypothesis model for human-pose 2D-to-3D lifting. We provide theoretical and empirical evidence that, due to the depth ambiguity inherent to monocular 3D human pose estimation, traditional regression models suffer from pose-topology consistency issues, which standard evaluation metrics (MPJPE, P-MPJPE and PCK) fail to assess. ManiPose addresses depth ambiguity by proposing multiple candidate 3D poses for each 2D input, each with its estimated plausibility. Unlike previous multi-hypothesis approaches, ManiPose forgoes generative models, greatly facilitating its training and usage. By constraining the outputs to lie on the human pose manifold, ManiPose guarantees the consistency of all hypothetical poses, in contrast to previous works. We showcase the performance of ManiPose on real-world datasets, where it outperforms state-of-the-art models in pose consistency by a large margin while being very competitive on the MPJPE metric.

Annealed Multiple Choice Learning: Overcoming limitations of Winner-takes-all with annealing

Authors: David Perera Victor Letzelter Théo Mariotte Adrien Cortés Mickael Chen Slim Essid Gaël Richard

[Paper] [Code] [Project page]

We introduce Annealed Multiple Choice Learning (aMCL) which combines simulated annealing with MCL. MCL is a learning framework handling ambiguous tasks by predicting a small set of plausible hypotheses. These hypotheses are trained using the Winner-takes-all (WTA) scheme, which promotes the diversity of the predictions. However, this scheme may converge toward an arbitrarily suboptimal local minimum, due to the greedy nature of WTA. We overcome this limitation using annealing, which enhances the exploration of the hypothesis space during training. We leverage insights from statistical physics and information theory to provide a detailed description of the model training trajectory. Additionally, we validate our algorithm by extensive experiments on synthetic datasets, on the standard UCI benchmark, and on speech separation.

GEPS: Boosting Generalization in Parametric PDE Neural Solvers through Adaptive Conditioning

Authors: Armand Kassaï Koupaï Jorge Mifsut-Benet Yuan Yin Jean-Noël Vittaut Patrick Gallinari

[Paper] [Code] [Project page]

Solving parametric partial differential equations (PDEs) presents significant challenges for data-driven methods due to the sensitivity of spatio-temporal dynamics to variations in PDE parameters. Machine learning approaches often struggle to capture this variability. To address this, data-driven approaches learn parametric PDEs by sampling a very large variety of trajectories with varying PDE parameters. We first show that incorporating conditioning mechanisms for learning parametric PDEs is essential and that among them, \textit{adaptive conditioning}, allows stronger generalization. As existing adaptive conditioning methods do not scale well with respect to the number of PDE parameters, we propose GEPS, a simple adaptation mechanism to boost GEneralization in Pde Solvers via a first-order optimization and low-rank rapid adaptation of a small set of context parameters. We demonstrate the versatility of our approach for both fully data-driven and for physics-aware neural solvers. Validation performed on a whole range of spatio-temporal forecasting problems demonstrates excellent performance for generalizing to unseen conditions including initial conditions, PDE coefficients, forcing terms and solution domain.

A Concept-Based Explainability Framework for Large Multimodal Models

Authors: Jayneel Parekh Pegah Khayatan Mustafa Shukor Alasdair Newson Matthieu Cord

[Paper] [Code] [Project page]

Large multimodal models (LMMs) combine unimodal encoders and large language models (LLMs) to perform multimodal tasks. Despite recent advancements towards the interpretability of these models, understanding internal representations of LMMs remains largely a mystery. In this paper, we present a novel framework for the interpretation of LMMs. We propose a dictionary learning based approach, applied to the representation of tokens. The elements of the learned dictionary correspond to our proposed concepts. We show that these concepts are well semantically grounded in both vision and text. Thus we refer to these as “multi-modal concepts”. We qualitatively and quantitatively evaluate the results of the learnt concepts. We show that the extracted multimodal concepts are useful to interpret representations of test samples. Finally, we evaluate the disentanglement between different concepts and the quality of grounding concepts visually and textually.

DiffCut: Catalyzing Zero-Shot Semantic Segmentation with Diffusion Features and Recursive Normalized Cut

Authors: Paul Couairon Mustafa Shukor Jean-Emmanuel Haugeard Matthieu Cord Nicolas Thome/a>

[Paper] [Code] [Project page]

Foundation models have emerged as powerful tools across various domains including language, vision, and multimodal tasks. While prior works have addressed unsupervised image segmentation, they significantly lag behind supervised models. In this paper, we use a diffusion UNet encoder as a foundation vision encoder and introduce DiffCut, an unsupervised zero-shot segmentation method that solely harnesses the output features from the final self-attention block. Through extensive experimentation, we demonstrate that the utilization of these diffusion features in a graph based segmentation algorithm, significantly outperforms previous state-of-the-art methods on zero-shot segmentation. Specifically, we leverage a recursive Normalized Cut algorithm that softly regulates the granularity of detected objects and produces well-defined segmentation maps that precisely capture intricate image details. Our work highlights the remarkably accurate semantic knowledge embedded within diffusion UNet encoders that could then serve as foundation vision encoders for downstream tasks.

Implicit Multimodal Alignment: On the Generalization of Frozen LLMs to Multimodal Inputs

Authors: Mustafa Shukor Matthieu Cord

[Paper] [Code] [Project page]

Large Language Models (LLMs) have demonstrated impressive performance on multimodal tasks, without any multimodal finetuning. They are the building block for Large Multimodal Models, yet, we still lack a proper understanding of their success. In this work, we expose frozen LLMs to image, video, audio and text inputs and analyse their internal representation aiming to understand their generalization beyond textual inputs. Findings. Perceptual tokens (1) are easily distinguishable from textual ones inside LLMs, with significantly different representations, and complete translation to textual tokens does not exist. Yet, (2) both perceptual and textual tokens activate similar LLM weights. Despite being different, (3) perceptual and textual tokens are implicitly aligned inside LLMs, we call this the implicit multimodal alignment (IMA), and argue that this is linked to architectural design, helping LLMs to generalize. This provide more evidence to believe that the generalization of LLMs to multimodal inputs is mainly due to their architecture. Implications. (1) We find a positive correlation between the implicit alignment score and the task performance, suggesting that this could act as a proxy metric for model evaluation and selection. (2) A negative correlation exists regarding hallucinations, revealing that this problem is mainly due to misalignment between the internal perceptual and textual representations. (3) Perceptual tokens change slightly throughout the model, thus, we propose different approaches to skip computations (e.g. in FFN layers), and significantly reduce the inference cost. (4) Due to the slowly changing embeddings across layers, and the high overlap between textual and multimodal activated weights, we compress LLMs by keeping only 1 subnetwork that works well across a wide range of multimodal tasks.

valeo.ai at ECCV 2024

2024-09-25T00:00:00+00:00

The European Conference on Computer Vision (ECCV) is a biennial landmark conference for the increasingly large community of researchers in computer vision and machine learning from both academia and industry. At the 2024 edition the valeo.ai team will present five papers in the main conference and four in the workshops. We are also organizing two tutorials (Bayesian Odyssey and Time is precious: Self-Supervised Learning Beyond Images), the Uncertainty Quantification for Computer Vision workshop, a talk at the OmniLabel workshop, and the BRAVO challenge. Our team has also contributed to the reviewing process, with seven reviewers, three area chairs, and two outstanding reviewer awards. The team will be at ECCV to present these works and will be happy to discuss more about these projects and ideas, and share our exciting ongoing research.

Train Till You Drop: Towards Stable and Robust Source-free Unsupervised 3D Domain Adaptation

Authors: Bjoern Michele Alexandre Boulch Tuan-Hung Vu Gilles Puy Renaud Marlet Nicolas Courty

[Paper] [Code] [Project page]

Evolution of the performance of baselines without degradation prevention strategies as they train over 20k iterations. Our method (TTYDcore) uses an unsupervised criterion to stop training. The horizontal dotted line illustrates that we keep the model obtained at the stopping point (marked with a cross).

We tackle the challenging problem of source-free unsupervised domain adaptation (SFUDA) for 3D semantic segmentation. It amounts to performing domain adaptation on an unlabeled target domain without any access to source data; the available information is a model trained to achieve good performance on the source domain. A common issue with existing SFUDA approaches is that performance degrades after some training time, which is a by-product of an under-constrained and ill-posed problem. We discuss two strategies to alleviate this issue. First, we propose a sensible way to regularize the learning problem. Second, we introduce a novel criterion based on agreement with a reference model. It is used (1) to stop the training when appropriate and (2) as validator to select hyperparameters without any knowledge on the target domain. Our contributions are easy to implement and readily amenable for all SFUDA methods, ensuring stable improvements over all baselines. We validate our findings on various 3D lidar settings, achieving state-of-the-art performance.

Examples of results with TTYDstop: ground truth (GT), initial model trained only on source data, training with our training scheme when using our stopping criterion, and “full” training for 20k iterations. Notable errors due to degradation are marked with a dashed rectangle.

UniTraj: A Unified Framework for Scalable Vehicle Trajectory Prediction

Authors: Lan Feng Mohammadhossein Bahari Kaouther Messaoud Ben Amor Éloi Zablocki Matthieu Cord Alexandre Alahi

[Paper] [Code] [Project page] [Video]

Vehicle trajectory prediction has increasingly relied on data-driven solutions, but their ability to scale to different data domains and the impact of larger dataset sizes on their generalization remain under-explored. While these questions can be studied by employing multiple datasets, it is challenging due to several discrepancies, e.g., in data formats, map resolution, and semantic annotation types. To address these challenges, we introduce UniTraj, a comprehensive framework that unifies various datasets, models, and evaluation criteria, presenting new opportunities for the vehicle trajectory prediction field. In particular, using UniTraj, we conduct extensive experiments and find that model performance significantly drops when transferred to other datasets. However, enlarging data size and diversity can substantially improve performance, leading to a new state-of-the-art result for the nuScenes dataset. We provide insights into dataset characteristics to explain these findings.

Overview of UniTraj: a unified platform for comprehensive research in trajectory prediction.

Lost and Found: Overcoming Detector Failures in Online Multi-Object Tracking

Authors: Lorenzo Vaquero Yihong Xu Xavier Alameda-Pineda Víctor M. Brea Manuel Mucientes

[Paper] [Code] [Project page]

Multi-object tracking (MOT) endeavors to precisely estimate the positions and identities of multiple objects over time. The prevailing approach, tracking-by-detection (TbD), first detects objects and then links detections, resulting in a simple yet effective method. However, contemporary detectors may occasionally miss some objects in certain frames, causing trackers to cease tracking prematurely. To tackle this issue, we propose BUSCA, meaning ‘to search’, a versatile framework compatible with any online TbD system, enhancing its ability to persistently track those objects missed by the detector, primarily due to occlusions. Remarkably, this is accomplished without modifying past tracking results or accessing future frames, i.e., in a fully online manner. BUSCA generates proposals based on neighboring tracks, motion, and learned tokens. Utilizing a decision Transformer that integrates multimodal visual and spatiotemporal information, it addresses the object-proposal association as a multi-choice question-answering task. BUSCA is trained independently of the underlying tracker, solely on synthetic data, without requiring fine-tuning. Through BUSCA, we showcase consistent performance enhancements across five different trackers and establish a new state-of-the-art baseline across three different benchmarks.

Overview of BUSCA: Enhancing multi-object trackers by finding undetected objects.

CLIP-DINOiser: Teaching CLIP a few DINO tricks for open-vocabulary semantic segmentation

Authors: Monika Wysoczańska Oriane Siméoni Michaël Ramamonjisoa Andrei Bursuc Tomasz Trzciński Patrick Pérez

[Paper] [Code] [Project page]

The popular CLIP model displays impressive zero-shot capabilities thanks to its seamless interaction with arbitrary text prompts. However, its lack of spatial awareness makes it unsuitable for dense computer vision tasks, e.g., semantic segmentation, without an additional fine-tuning step that often uses annotations and can potentially suppress its original open-vocabulary properties. Meanwhile, self-supervised representation methods have demonstrated good localization properties without human-made annotations nor explicit supervision. In this work, we take the best of both worlds and propose an open-vocabulary semantic segmentation method, which does not require any annotations.

Examples of open-vocabulary semantic segmentation results obtained with our method CLIP-DINOiser on ‘in-the-wild’ images vs. those of MaskCLIP.

We propose to locally improve dense MaskCLIP features, which are computed with a simple modification of CLIP’s last pooling layer, by integrating localization priors extracted from self-supervised features from DINO. By doing so, we greatly improve the performance of MaskCLIP and produce smooth outputs. Moreover, we show that the used self-supervised feature properties can directly be learnt from CLIP features. Our method CLIP-DINOiser needs only a single forward pass of CLIP and two light convolutional layers at inference, no extra supervision nor extra memory and reaches state-of-the-art results on challenging and fine-grained benchmarks such as COCO, Pascal Context, Cityscapes and ADE20k.

Overview of CLIP-DINOiser. We use DINO as a teacher which ‘teaches’ CLIP how to extract localization information with similar patch correlations. At inference, an input image is forwarded through the frozen CLIP image backbone and MaskCLIP projection. The produced features are then improved with our pooling strategy which is guided by correlations predicted with a trained convolutional layer applied on CLIP.

Reliability in Semantic Segmentation: Can We Use Synthetic Data

Authors: Thibaut Loiseau Tuan-Hung Vu Mickael Chen Patrick Pérez Matthieu Cord

[Paper] [Code] [Project page]

Assessing the reliability of perception models to covariate shifts and out-of-distribution (OOD) detection is crucial for safety-critical applications such as autonomous vehicles. By nature of the task, however, the relevant data is difficult to collect and annotate. In this paper, we challenge cutting-edge generative models to automatically synthesize data for assessing reliability in semantic segmentation. By fine-tuning Stable Diffusion, we perform zero-shot generation of synthetic data in OOD domains or inpainted with OOD objects. Synthetic data is employed to provide an initial assessment of pretrained segmenters, thereby offering insights into their performance when confronted with real edge cases. Through extensive experiments, we demonstrate a high correlation between the performance on synthetic data and the performance on real OOD data, showing the validity approach. Furthermore, we illustrate how synthetic data can be utilized to enhance the calibration and OOD detection capabilities of segmenters.

Assessing 40 pretrained segmenters under covariate shifts. Segmentation models under scrutiny were trained on Cityscapes train set only (in-domain data). They are evaluated on (i) Cityscapes validation set, (ii) real OOD data, and (iii) proposed synthetic data. We observe a strong correlation between results on (ii) and (iii).

Valeo4Cast: A Modular Approach to End-to-End Forecasting

ECCV 2024 ROAD++ Workshop

Winning solution in Argoverse 2 Unified Detection, Tracking and Forecasting Challenge

Authors: Yihong Xu, Éloi Zablocki, Alexandre Boulch, Gilles Puy, Mickaël Chen, Florent Bartoccioni, Nermin Samet, Oriane Siméoni, Spyros Gidaris, Tuan-Hung Vu, Andrei Bursuc, Eduardo Valle, Renaud Marlet, Matthieu Cord

[Paper] [leaderboard] [page]

Motion forecasting is crucial in autonomous driving systems to anticipate the future trajectories of surrounding agents such as pedestrians, vehicles, and traffic signals. In end-to-end forecasting, the model must jointly detect from sensor data (cameras or LiDARs) the position and past trajectories of the different elements of the scene and predict their future location. We depart from the current trend of tackling this task via end-to-end training from perception to forecasting and we use a modular approach instead. Following a recent study, we individually build and train detection, tracking, and forecasting modules. We then only use consecutive finetuning steps to integrate the modules better and alleviate compounding errors. Our study reveals that this simple yet effective approach significantly improves performance on the end-to-end forecasting benchmark. Consequently, our solution ranks first in the Argoverse 2 end-to-end Forecasting Challenge held at CVPR 2024 Workshop on Autonomous Driving (WAD), with 63.82 mAPf. We surpass forecasting results by +17.1 points over last year’s winner and by +13.3 points over this year’s runner-up. This remarkable performance in forecasting can be explained by our modular paradigm, which integrates finetuning strategies and significantly outperforms the end-to-end-trained counterparts.

Valeo4Cast overview.

PAFUSE: Part-based Diffusion for 3D Whole-Body Pose Estimation

ECCV 2024 Workshop Towards a Complete Analysis of People (T-CAP)

Authors: Nermin Samet Cédric Rommel David Picard Eduardo Valle

[Paper] [Code] [Project page]

We introduce a novel approach for 3D whole-body pose estimation, addressing the challenge of scale- and deformability- variance across body parts brought by the challenge of extending the 17 major joints on the human body to fine-grained keypoints on the face and hands. In addition to addressing the challenge of exploiting motion in unevenly sampled data, we combine stable diffusion to a hierarchical part representation which predicts the relative locations of fine-grained keypoints within each part (e.g., face) with respect to the part’s local reference frame. On the H3WB dataset, our method greatly outperforms the current state of the art, which fails to exploit the temporal information. We also show considerable improvements compared to other spatiotemporal 3D human-pose estimation approaches that fail to account for the body part specificities.

Overview of PAFUSE.

LLM-wrapper: Black-Box Semantic-Aware Adaptation of Vision-Language Foundation Models

ECCV 2024 Workshop on Emergent Visual Abilities and Limits of Foundation Models (Eval-FoMo)

Authors: Amaia Cardiel Éloi Zablocki Oriane Siméoni Elias Ramzi Matthieu Cord

[Paper] [Project page]

Vision Language Models (VLMs) have shown impressive performances on numerous tasks but their zero-shot capabilities can be limited compared to dedicated or fine-tuned models. Yet, fine-tuning VLMs comes with strong limitations as it requires a ‘white-box’ access to the model’s architecture and weights while some recent models are proprietary (e.g., Grounding DINO 1.5). It also requires expertise to design the fine-tuning objectives and optimize the hyper-parameters, which are specific to each VLM and downstream task. In this work, we propose LLM-wrapper, a novel approach to adapt VLMs in a ‘black-box’ and semantic-aware manner by leveraging large language models (LLMs) so as to reason on their outputs.

Overview of LLM-Wrapper.

We demonstrate the effectiveness of LLM-wrapper on Referring Expression Comprehension (REC), a challenging open-vocabulary task that requires spatial and semantic reasoning. Our approach significantly boosts the performance of off-the-shelf models, yielding results that are on par or competitive when compared with classic VLM fine-tuning (cf ‘FT VLM’ in our main results). Despite a few failure cases due to the LLM ‘blindness’ (cf Qualitative results, bottom right)), LLM-wrapper shows better semantic, spatial and relational reasoning, as illustrated in our qualitative results below.

LLM-Wrapper results.

ReGentS: Real-World Safety-Critical Driving Scenario Generation Made Stable

ECCV 2024 Workshop on Multimodal Perception and Comprehension of Corner Cases in Autonomous Driving (W-CODA)

Authors: Yuan Yin Pegah Khayatan Éloi Zablocki Alexandre Boulch Matthieu Cord

[Paper] [Project page]

Machine learning based autonomous driving systems often face challenges with safety-critical scenarios that are rare in real-world data, hindering their large-scale deployment. While increasing real-world training data coverage could address this issue, it is costly and dangerous. This work explores generating safety-critical driving scenarios by modifying complex real-world regular scenarios through trajectory optimization. We propose ReGentS, which stabilizes generated trajectories and introduces heuristics to avoid obvious collisions and optimization problems. Our approach addresses unrealistic diverging trajectories and unavoidable collision scenarios that are not useful for training robust planner. We also extend the scenario generation framework to handle real-world data with up to 32 agents. Additionally, by using a differentiable simulator, our approach simplifies gradient descent-based optimization involving a simulator, paving the way for future advancements.

The BRAVO Semantic Segmentation Challenge Results in UNCV2024

ECCV 2024 Workshop on Uncertainty Quantification for Computer Vision

Authors: Tuan-Hung Vu Eduardo Valle Andrei Bursuc Tommie Kerssies Daan de Geus Gijs Dubbelman Long Qian Bingke Zhu Yingying Chen Ming Tang Jinqiao Wang Tomáš Vojíř Jan Šochman Jiří Matas Michael Smith Frank Ferrie Shamik Basu Christos Sakaridis Luc Van Gool

[Paper] [Code] [Project page]

We propose the unified BRAVO challenge to benchmark the reliability of semantic segmentation models under realistic perturbations and unknown out-of-distribution (OOD) scenarios. We define two categories of reliability: (1) semantic reliability, which reflects the model’s accuracy and calibration when exposed to various perturbations; and (2) OOD reliability, which measures the model’s ability to detect object classes that are unknown during training. The challenge attracted nearly 100 submissions from international teams representing notable research institutions. The results reveal interesting insights into the importance of large-scale pre-training and minimal architectural design in developing robust and reliable semantic segmentation models.

All submissions. Aggregated metrics (out-of-distribution and semantic) on axes, ranking metric (BRAVO Index) on level set. More freedom on the training dataset (Task 2, in orange) did not translate into better results.

valeo.ai at CVPR 2024

2024-06-13T00:00:00+00:00

The IEEE / CVF Computer Vision and Pattern Recognition Conference (CVPR) is a key event for researchers and engineers working on computer vision and machine learning. At the 2024 edition the valeo.ai team will present eight papers in the main conference, two papers in workshops, and one workshop keynote. Also, the team will present its winning solution to the Argoverse 2 “Unified Detection, Tracking and Forecasting” challenge held at the Workshop on Autonomous Driving. The team will be at CVPR to present these works and will be happy to discuss more about these projects and ideas, and share our exciting ongoing research. We outline our team papers below.

Three Pillars Improving Vision Foundation Model Distillation for Lidar

Authors: Gilles Puy, Spyros Gidaris, Alexandre Boulch, Oriane Siméoni, Corentin Sautier, Andrei Bursuc, Patrick Pérez, Renaud Marlet

[Paper] [Code] [Video] [Project page]

Self-supervised image backbones can be used to address complex 2D tasks (e.g., semantic segmentation, object discovery) very efficiently and with little or no downstream supervision. Ideally, 3D backbones for lidar should be able to inherit these properties after distillation of these powerful 2D features. The most recent methods for image-to-lidar distillation on autonomous driving data show promising results, obtained thanks to distillation methods that keep improving. Yet, we still notice a large performance gap when measuring by linear probing the quality of distilled vs fully supervised features.

In this work, instead of focusing only on the distillation method, we study the effect of three pillars for distillation: the 3D backbone, the pretrained 2D backbone, and the pretraining 2D+3D dataset. In particular, thanks to our scalable distillation method named ScaLR, we show that scaling the 2D and 3D backbones and pretraining on diverse datasets leads to a substantial improvement of the feature quality. This allows us to significantly reduce the gap between the quality of distilled and fully-supervised 3D features, and to improve the robustness of the pretrained backbones to domain gaps and perturbations. We show that scaling the 2D and 3D backbones, and pretraining on diverse datasets leads to considerable improvements of the feature quality. The role of these pillars is actually more important than the distillation method itself, which we simplify for easier scaling.

ScaLR image-to-lidar distillation method with the three pillars studied in this work.

In this work, after proposing and studying a scalable distillation method, which we call ScaLR for Scalable Lidar Representation (see Figure above), we make the following contributions.

First, we are able to significantly reduce the gap between distilled and supervised lidar representations: on nuScenes, we increase the performance by 22.8 mIoU percentage points compared to the former best distillation method.

Second, we show it is possible to pretrain a single backbone on a mixture of datasets, performing similarly or better than separate backbones specialized on each dataset individually. The capacity of this backbone in providing good features across multiple datasets is illustrated in the figure below. For each scene in this figure, we pick a point located on a car and present the feature correlation map with respect to this point. We notice that the most correlated points also belong to cars on all datasets, illustrating the capacity of our single pretrained backbone to correctly distinguish objects on multiple datasets.

Correlation maps with a point located on a car on four different scenes extracted from nuScenes, SemanticKITTI, PandaSet-64 and PandaSet-GT, respectively. The features used to compute these maps are extracted from a single pretrained backbone on all four datasets with ScaLR. Color goes from blue to red for low and high values.

Third, we thoroughly study the properties of our distilled features. We show that they are robust to both domain gaps and perturbations. We also show that pretraining on diverse datasets improves robustness.

Finally, we show that a possible way to get even better features is to distill the knowledge from multiple vision foundation models at the same time, which can be easily done with our scalable distillation strategy

PointBeV: A Sparse Approach to BeV Predictions

Authors: Loïck Chambon, Éloi Zablocki, Mickaël Chen, Florent Bartoccioni, Patrick Pérez, Matthieu Cord

[Paper] [Code] [Project page]

Bird’s-eye View (BeV) representations have emerged as the de-facto shared space in driving applications, offering a unified space for sensor data fusion and supporting various downstream tasks. However, conventional models use grids with fixed resolution and range and face computational inefficiencies due to the uniform allocation of resources across all cells. To address this, we propose PointBeV, a novel sparse BeV segmentation model operating on sparse BeV cells instead of dense grids. This approach offers precise control over memory usage, enabling the use of long temporal contexts and accommodating memory-constrained platforms. PointBeV employs an efficient two-pass strategy for training, enabling focused computation on regions of interest. At inference time, it can be used with various memory/performance trade-offs and flexibly adjusts to new specific use cases.

PointBeV overview. As a sparse method, PointBeV is trained using local predictions, only for sampled 2D points provided as inputs. The points of interest are lifted to form 3D pillars, with each 3D point pulling visual features. To achieve this, PointBeV incorporates an efficient feature extraction process through a Sparse Feature Pulling module, illustrated in the ‘efficient feature extraction’ block. The obtained 3D BeV features are then flattened onto the 2D BeV plane and processed using a sparse U-Net with task-dependent final heads, generating local BeV predictions. For training, we only need sparse signals. At test time, points that have not been sampled are set to zero.

PointBeV achieves state-of-the-art results on the nuScenes dataset for vehicle, pedestrian, and lane segmentation, showcasing superior performance in static and temporal settings despite being trained solely with sparse signals. We will release our code along with two new efficient modules used in the architecture: Sparse Feature Pulling, designed for the effective extraction of features from images to BeV, and Submanifold Attention, which enables efficient temporal modeling.

BeV vehicle IoU vs. memory footprint on nuScenes. The size of a dot represents the number of BeV points being evaluated, the smaller the better. PointBeV has the capacity to explore various trade-offs between efficiency and performance by varying the number of points being considered. The remaining points are considered as zeros in the final prediction. Using PointBeV we can achieve state-of-the-art performance with only a small portion of the points and without losing performance. The memory consumption is calculated using a 40GB A100 GPU.

Don’t drop your samples! Coherence-aware training benefits Conditional diffusion

Highlight

Authors: Nicolas Dufour , Victor Besnier, Vicky Kalogeiton, David Picard

[Paper] [Code] [Video] [Project page]

Conditional diffusion models are powerful generative models that can leverage various types of conditional information, such as class labels, segmentation masks, or text captions. However, in many real-world scenarios, conditional information may be noisy or unreliable due to human annotation errors or weak alignment. In this paper, we propose the Coherence-Aware Diffusion (CAD), a novel method that integrates coherence in conditional information into diffusion models, allowing them to learn from noisy annotations without discarding data. We assume that each data point has an associated coherence score that reflects the quality of the conditional information. We then condition the diffusion model on both the conditional information and the coherence score. In this way, the model learns to ignore or discount the conditioning when the coherence is low. We show that CAD is theoretically sound and empirically effective on various conditional generation tasks. Moreover, we show that leveraging coherence generates realistic and diverse samples that respect conditional information better than models trained on cleaned datasets where samples with low coherence have been discarded.

Overview of Don't Drop your Samples.

Supervised Anomaly Detection for Complex Industrial Images

Authors: Aimira Baitieva , David Hurych, Victor Besnier, Olivier Bernard

[Paper] [Code] [Project page]

Automating visual inspection in industrial production lines is essential for increasing product quality across various industries. Anomaly detection (AD) methods serve as robust tools for this purpose. However, existing public datasets primarily consist of images without anomalies, limiting the practical application of AD methods in production settings. To address this challenge, we present (1) the Valeo Anomaly Dataset (VAD), a novel real-world industrial dataset comprising 5000 images, including 2000 instances of challenging real defects across more than 20 subclasses. Acknowledging that traditional AD methods struggle with this dataset, we introduce (2) Segmentation based Anomaly Detector (SegAD). First, SegAD leverages anomaly maps as well as segmentation maps to compute local statistics. Next, SegAD uses these statistics and an optional supervised classifier score as input features for a Boosted Random Forest (BRF) classifier, yielding the final anomaly score. Our SegAD achieves state-of-the-art performance on both VAD (+2.1% AUROC) and the VisA dataset (+0.4% AUROC). The code and the models are publicly available

Overview of Supervised Anomaly Detection for Complex Industrial Images

A Simple Recipe for Language-guided Domain Generalized Segmentation

Authors: Mohammad Fahes, Tuan-Hung Vu, Andrei Bursuc, Patrick Pérez, Raoul de Charette

[Paper] [Code] [Video] [page]

Generalization to new domains not seen during training is one of the long-standing goals and challenges in deploying neural networks in real-world applications. Existing generalization techniques necessitate substantial data augmentation, potentially sourced from external datasets, and aim at learning invariant representations by imposing various alignment constraints. Large-scale pretraining has recently shown promising generalization capabilities, along with the potential of bridging different modalities. For instance, the recent advent of vision-language models like CLIP has opened the doorway for vision models to exploit the textual modality. In this paper, we introduce a simple framework for generalizing semantic segmentation networks by employing language as the source of randomization. Our recipe comprises three key ingredients: i) the preservation of the intrinsic CLIP robustness through minimal fine-tuning, ii) language-driven local style augmentation, and iii) randomization by locally mixing the source and augmented styles during training. Extensive experiments report state-of-the-art results on various generalization benchmarks.

Overall process of FAMix. FAMix consists of two steps. (Left) Local style mining consists of dividing the low-level feature activations into patches, which are used for style mining using Prompt-driven Instance Normalization (PIN). Specifically, for each patch, the dominant class is queried from the ground truth, and the mined style is added to corresponding class-specific style bank. (Right) Training the segmentation network is performed with minimal fine-tuning of the backbone. At each iteration, the low-level feature activations are viewed as grids of patches. For each patch, the dominant class is queried using the ground truth, then a style is sampled from the corresponding style bank. Style randomization is performed by normalizing each patch in the grid by its statistics, and transferring the new style which is a mixing between the original style and the sampled one. The network is trained using only a cross-entropy loss.

Qualitative results. Columns 1-2: Image and ground truth (GT), Columns 3-4-5: Different domain generalization methods, Column 6: Our results.

Make Me a BNN: A Simple Strategy for Estimating Bayesian Uncertainty from Pre-trained Models

Authors: Gianni Franchi, Olivier Laurent, Maxence Leguéry, Andrei Bursuc, Andrea Pilzer, Angela Yao

[Paper] [Code] [Video] [page]

Deep Neural Networks (DNNs) are powerful tools for various computer vision tasks, yet they often struggle with reliable uncertainty quantification — a critical requirement for real-world applications. Bayesian Neural Networks (BNN) are equipped for uncertainty estimation but cannot scale to large DNNs where they are highly unstable to train. To address this challenge, we introduce the Adaptable Bayesian Neural Network (ABNN), a simple and scalable strategy to seamlessly transform DNNs into BNNs in a post-hoc manner with minimal computational and training overheads. ABNN preserves the main predictive properties of DNNs while enhancing their uncertainty quantification abilities through simple BNN adaptation layers (attached to normalization layers) and a few fine-tuning steps on pre-trained models. We conduct extensive experiments across multiple datasets for image classification and semantic segmentation tasks, and our results demonstrate that ABNN achieves state-of-the-art performance without the computational budget typically associated with ensemble methods.

Illustration of the training process for the ABNN. The procedure begins with training a single DNN $\omega_{\text{MAP}}$, followed by architectural adjustments on the normalization layers to transform it into an ABNN. The final step involves fine-tuning the ABNN model.

SPOT: Self-Training with Patch-Order Permutation for Object-Centric Learning with Autoregressive Transformers

Highlight

Authors: Ioannis Kakogeorgiou, Spyros Gidaris, Konstantinos Karantzalos, Nikos Komodakis

[Paper] [Code] [page]

Unsupervised object-centric learning aims to decompose scenes into interpretable object entities, termed slots. Slot-based auto-encoders stand out as a prominent method for this task. Within them, crucial aspects include guiding the encoder to generate object-specific slots and ensuring the decoder utilizes them during reconstruction. This work introduces two novel techniques, (i) an attention-based self-training approach, which distills superior slot-based attention masks from the decoder to the encoder, enhancing object segmentation , and (ii) an innovative patch-order permutation strategy for autoregressive transformers that strengthens the role of slot vectors in reconstruction.

Enhancing unsupervised object-centric learning via self-training. Our two-stage approach starts with exclusive training in the initial stage (not depicted) using the reconstruction loss. In the following stage, shown here, a teacher-student framework is applied. The teacher model, trained in the first stage, guides the student model with an additional loss, distilling attention masks from the teacher’s decoder to the slot-attention masks in the student’s encoder.

The effectiveness of these strategies is showcased experimentally. The combined approach significantly surpasses prior slot-based autoencoder methods in unsupervised object segmentation, especially with complex real-world images.

SPOT visualizations. Our novel framework enhances unsupervised object-centric learning in slot-based autoencoders using self-training and sequence permutations in the transformer decoder. It improves object-specific slot generation, excelling in complex real-world images.

NOPE: Novel Object Pose Estimation from a Single Image

Authors: Van Nguyen Nguyen, Thibault Groueix, Georgy Ponimatkin, Yinlin Hu, Renaud Marlet, Mathieu Salzmann, Vincent Lepetit

[Paper] [Code] [page]

TL;DR: We introduce NOPE, a simple approach to estimate relative pose of unseen objects given only a single reference image. NOPE also predicts 3D pose distribution which can be used to address pose ambiguities due to symmetries.

The practicality of 3D object pose estimation remains limited for many applications due to the need for prior knowledge of a 3D model and a training period for new objects. To address this limitation, we propose an approach that takes a single image of a new object as input and predicts the relative pose of this object in new images without prior knowledge of the object’s 3D model and without requiring training time for new objects and categories. We achieve this by training a model to directly predict discriminative embeddings for viewpoints surrounding the object. This prediction is done using a simple U-Net architecture with attention and conditioned on the desired pose, which yields extremely fast inference. We compare our approach to state-of-the-art methods and show it outperforms them both in terms of accuracy and robustness.

NOPE qualitative results.

Valeo4Cast: A Modular Approach to End-to-End Forecasting

Winning solution in Argoverse 2 Unified Detection, Tracking and Forecasting Challenge, at CVPR WAD 2024

Authors: Yihong Xu, Éloi Zablocki, Alexandre Boulch, Gilles Puy, Mickaël Chen, Florent Bartoccioni, Nermin Samet, Oriane Siméoni, Spyros Gidaris, Tuan-Hung Vu, Andrei Bursuc, Eduardo Valle, Renaud Marlet, Matthieu Cord

[Paper] [leaderboard] [page]

Valeo4Cast overview.

OccFeat: Self-supervised Occupancy Feature Prediction for Pretraining BEV Segmentation Networks

CVPR 2024 Workshop on Autonomous Driving (WAD)

Authors: Sophia Sirko-Galouchenko, Alexandre Boulch, Spyros Gidaris, Andrei Bursuc, Antonin Vobecky, Renaud Marlet, Patrick Pérez

[Paper] [page]

We introduce a self-supervised pretraining method, called OccFeat, for camera-only Bird’s-Eye-View (BEV) segmentation networks. With OccFeat, we pretrain a BEV network via occupancy prediction and feature distillation tasks. Occupancy prediction provides a 3D geometric understanding of the scene to the model. However, the geometry learned is class-agnostic. Hence, we add semantic information to the model in the 3D space through distillation from a self-supervised pretrained image foundation model. Models pretrained with our method exhibit improved BEV semantic segmentation performance, particularly in low-data scenarios. Moreover, empirical results affirm the efficacy of integrating feature distillation with 3D occupancy prediction in our pretraining approach.

Overview of OccFeat’s self-supervised BEV pretraining approach. OccFeat attaches an auxiliary pretraining head on top of the BEV network. This head “unsplats” the BEV features to a 3D feature volume and predicts with it (a) the 3D occupancy of the scene (occupancy reconstruction loss) and (b) high-level self-supervised image features characterizing the occupied voxels (occupancy-guided distillation loss). The occupancy targets are produced by “voxelizing” Lidar points, while the self-supervised image foundation model DINOv2 provides the feature targets for the occupied voxels. The pretraining head is removed after the pretraining.

The results show the benefit of our pretraining method, especially in low-shot regimes, e.g., when using annotations only for 1% or 10% of nuScene’s training data. Additionally, our OccFeat pretraining improves the robustness, as evaluated on the nuScenes-C benchmark.

Performance comparison in low data regime 1% annotated data of nuScenes (Left). Study on robustness. Segmentation results on nuScenes-C dataset for Vehicle classes using BEVFormer network with EN-B0 image backbone on 100% annotated data. Comparison of our OccFeat against no BEV pretraining (Right).

What Makes Multimodal In-Context Learning Work?

CVPR 2024 Workshop on Prompting in Vision

Authors: Folco Bertini Baldassini, Mustafa Shukor, Matthieu Cord, Laure Soulier, Benjamin Piwowarski

[Paper] [code] [page]

Large Language Models have demonstrated remarkable performance across various tasks, exhibiting the capacity to swiftly acquire new skills, such as through In-Context Learning (ICL) with minimal demonstration examples. In this work, we present a comprehensive framework for investigating Multimodal ICL (M-ICL) in the context of Large Multimodal Models. We consider the best open-source multimodal models (e.g., IDEFICS, OpenFlamingo) and a wide range of multimodal tasks. Our study unveils several noteworthy findings: (1) M-ICL primarily relies on text-driven mechanisms, showing little to no influence from the image modality. (2) When used with advanced-ICL strategy (like RICES), M-ICL is not better than a simple strategy based on majority voting over context examples. Moreover, we identify several biases and limitations of M-ICL that warrant consideration prior to deployment.

Empirical analysis of Multimodal In-Context Learning (M-ICL) behavior.

valeo.ai at NeurIPS 2023

2023-12-08T00:00:00+00:00

Notably, we explore perception via different sensors, e.g., audio, on the path towards increasingly autonomous systems. We also study the interaction between different sensing modalities (images, language, Lidar point clouds) and advance a tri-modal self-supervised learning algorithm for 3D semantic voxel occupancy prediction from a rig of cameras mounted on a vehicle. We further show how to obtain robust deep models starting from pre-trained foundation models finetuned with reinforcement learning from human feedback. Finally, we analyze different generative models (diffusion models, GANs) and advance a unification framework considering them as instances of Particle Models.

Resilient Multiple Choice Learning: A learned scoring scheme with application to audio scene analysis

Authors: Victor Letzelter, Mathieu Fontaine, Mickaël Chen, Patrick Pérez, Slim Essid, Gaël Richard

[Paper] [Code] [Project page]

In this work, we tackle ambiguous machine learning tasks, where single predictions don’t suffice due to the task’s nature or inherent uncertainties. We introduce a robust multi-hypotheses framework that is capable of deterministically offering a range of plausible predictions at inference time. Our experiments on both synthetic data and real-world audio data affirm the potential and versatility of our method. Check out the paper and the code for more details.

This problem involves estimating a conditional distribution that is dependent on the input. The accompanying animation illustrates the early stages in the evolution of our model’s learning process, highlighting how it progressively refines its predictions (represented by shaded blue points) to the actual data distribution (indicated by green points), which varies with the input ‘t’.

POP-3D: Open-Vocabulary 3D Occupancy Prediction from Images

Authors: Antonin Vobecky, Oriane Siméoni, David Hurych, Spyros Gidaris, Andrei Bursuc, Patrick Pérez, Josef Sivic

[Paper] [Code] [Project page]

POP-3D is an approach to predict open-vocabulary 3D semantic voxel occupancy map from input 2D images to enable 3D grounding, segmentation, and retrieval of free-form language queries.

Given surround-view images on the input, our POP-3D outputs voxel occupancy with 3D-language features, which one can query using text, e.g., to obtain zero-shot semantic segmentation.

We design a new model architecture for open-vocabulary 3D semantic occupancy prediction. The architecture consists of a 2D-3D encoder together with occupancy prediction and 3D-language heads. The output is a dense voxel map of 3D grounded language embeddings enabling a range of open-vocabulary tasks. Next, we develop a tri-modal self-supervised learning algorithm that leverages three modalities: (i) images, (ii) language, and (iii) LiDAR point clouds and enables training the proposed architecture using a strong pre-trained vision-language model without the need for any 3D manual language annotations.

Overview of POP-3D architecture and training approach.

Finally, we demonstrate the strengths of the proposed model quantitatively on several open-vocabulary tasks: Zero-shot 3D semantic segmentation using existing datasets; 3D grounding, and retrieval of free-form language queries, using a small dataset that we propose as an extension of nuScenes.

Rewarded soups: towards Pareto-optimal alignment by interpolating weights fine-tuned on diverse rewards

Authors: Alexandre Ramé, Guillaume Couairon, Mustafa Shukor, Corentin Dancette, Jean-Baptiste Gaya, Laure Soulier, Matthieu Cord

[Paper] [Code] [Project page]

Foundation models are first pre-trained on vast unsupervised datasets and then fine-tuned on labeled data. Reinforcement learning, notably from human feedback (RLHF), can further align the network with the intended usage. Yet the imperfections in the proxy reward may hinder the training and lead to suboptimal results; the diversity of objectives in real-world tasks and human opinions exacerbate the issue. This paper proposes embracing the heterogeneity of diverse rewards by following a multi-policy strategy. Rather than focusing on a single a priori reward, we aim for Pareto-optimal generalization across the entire space of preferences. To this end, we propose rewarded soup, first specializing multiple networks independently (one for each proxy reward) and then interpolating their weights linearly. This succeeds empirically because we show that the weights remain linearly connected when fine-tuned on diverse rewards from a shared pre-trained initialization. We demonstrate the effectiveness of our approach for text-to-text (summarization, Q&A, helpful assistant, review), text-image (image captioning, text-to-image generation, visual grounding, VQA), and control (locomotion) tasks. We hope to enhance the alignment of deep models, and how they interact with the world in all its diversity.

Illustration of the different steps of our proposed rewarded soup (RS). After unsupervised pre-training and supervised fine-tuning, we launch $N$ independent RL fine-tunings on the proxy rewards $\{R_i\}^{N}_{i=1}$. Then we combine the trained networks by interpolation in the weight space. The final weights are adapted at test time by selecting the coefficient $\lambda$.

Unifying GANs and Score-Based Diffusion as Generative Particle Models

Authors: Jean-Yves Franceschi, Mike Gartrell, Ludovic Dos Santos, Thibaut Issenhuth, Emmanuel de Bézenac, Mickaël Chen, Alain Rakotomamonjy

[Paper] [Code (coming soon)]

By describing the trajectories of GAN outputs during training with particle evolution equations, we propose an unifying framework for GAN and Diffusion Models. We provide a new insights on the role of the generator network, and as proof of concept validating our theories, we propose methods to train a generator with score-based gradient instead of a discriminator, or to use a discriminator’s gradient flow to generate instead of training a generator.

Evaluating the structure of cognitive tasks with transfer learning

NeurIPS Workshop on AI for Scientific Discovery: From Theory to Practice

Authors: Bruno Aristimunha, Raphael Y. de Camargo, Walter H. Lopez Pinaya, Sylvain Chevallier, Alexandre Gramfort, Cedric Rommel

[Paper] [Code (coming soon)]

Electroencephalography (EEG) decoding is a challenging task due to the limited availability of labeled data. While transfer learning is a promising technique to address this challenge, it assumes that transferable data domains and tasks are known, which is not the case in this setting. This work investigates the transferability of deep learning representations between different EEG decoding tasks.

Learned transferability maps for both datasets. Each node corresponds to a distinct cognitive task. Arrow width represents the average transfer performance when using the representations learned from a source task to decode a target task.

We conduct extensive experiments using state-of-the-art decoding models on two recently released EEG datasets, ERPCore and M3CV, containing over 140 subjects and 11 distinct cognitive tasks.

From an EEG processing perspective, our results can be used to leverage related datasets for alleviating EEG data scarcity with transfer learning. We show that even with a linear probing transfer method, we are able to boost by up to 28% the performance of some tasks. From a neuroscientific standpoint, our transfer maps provide insights into the hierarchical relations between cognitive tasks, hence enhancing our understanding of how these tasks are connected. We discover for example evidence that certain decoding paradigms elicit specific and narrow brain activities, while others benefit from pre-training on a broad range of representations.

valeo.ai at ICCV 2023

2023-09-26T00:00:00+00:00

The IEEE / CVF International Conference on Computer Vision (ICCV) is a landmark event for the increasingly large and diverse community of researchers in computer vision and machine learning. This year, ICCV takes place in Paris, home of the valeo.ai team. From interns to senior researchers, the valeo.ai team will participate in mass at ICCV and will be looking forward to welcoming you and talking about the exciting progress and ideas in the field.

At ICCV 2023 we will present 5 papers in the main conference and 3 in the workshops. We are also organizing 2 tutorials with 2 challenges (BRAVO and UNCV) and a tutorial (Many Faces of Reliability). Take a quick view of our papers in the conference and come meet us at the posters, at our booth or in the hallway.

Using a Waffle Iron for Automotive Point Cloud Semantic Segmentation

Authors: Gilles Puy, Alexandre Boulch, Renaud Marlet

[Paper] [Code] [Project page]

Semantic segmentation of point clouds delivered by lidars permits autonomous vehicles to make sense of their 3D surrounding environment. Sparse convolutions have become a de-facto tool to process these large outdoor point clouds. The top performing methods on public benchmarks, such SemanticKITTI or nuScenes, all leverage sparse convolutions. Nevertheless, despite their undeniable success and efficiency, these convolutions remain available in a limited number of deep learning frameworks and hardware platforms. In this work, we propose an alternative backbone built with tools broadly available (such as 2D and 1D convolutions) but that still reaches the level of performance of the top methods on automotive datasets.

We propose a point-based backbone, called WaffleIron, which is essentially built using standard MLPs and dense 2D convolutions, both readily available in all deep learning frameworks thanks to their wide use in the field of computer vision. The architecture of this backbone is illustrated in the figure below. It is inspired by the recent MLP-Mixer. It takes as input a point cloud with a token associated to each point. All these point tokens are then updated by a sequence of layers, each containing a token-mixing step (made of dense 2D convolutions) and a channel-mixing step (made of a MLP shared across points).

The WaffleIron backbone takes as input point tokens, provided by an embedding layer (not represented), and updates these point representations L times via a point token-mixing layer (containing the WI block) followed by a channel-mixing layer. The WI block consists of a 2D projection along one of the main axes, a feed-forward network (FFN) with two dense channel-wise 2D convolutions with a ReLU activation in the hidden layer, and a simple copy of the 2D features to the 3D points. The channel-mixing layer contains a batch-norm, a MLP shared across each point, and a residual connection. The WaffleIron backbone is free of any point downsampling or upsampling layer, farthest point sampling, nearest neighbor search, or sparse convolution.

WaffleIron has three main hyperparameters to tune: the depth L, the width F and the resolution of the 2D grid. We show that these parameters are easy to tune: the performance increases with the network width F and depth L, until an eventual saturation; we observe stable results over a wide range of values for the resolution of the 2D grid.

In our paper, we also provide many details on how to train WaffleIron to reach the performance of top-entries on two autonomous driving benchmarks: SemanticKITTI and nuScenes.

PØDA: Prompt-driven Zero-shot Domain Adaptation

Authors: Mohammad Fahes, Tuan-Hung Vu, Andrei Bursuc, Patrick Pérez, Raoul de Charette

[Paper] [Code] [Video] [Project page]

Domain adaptation has been vastly investigated in computer vision but still requires access to target images at train time, which might be intractable in some uncommon conditions. In this paper, we propose the task of ‘Prompt-driven Zero-shot Domain Adaptation’, where we adapt a model trained on a source domain using only a general description in natural language of the target domain, i.e., a prompt. First, we leverage a pre-trained contrastive vision-language model (CLIP) to optimize affine transformations of source features, steering them towards the target text embedding while preserving their content and semantics. To achieve this, we propose Prompt-driven Instance Normalization (PIN). Second, we show that these prompt-driven augmentations can be used to perform zero-shot domain adaptation for semantic segmentation. Experiments demonstrate that our method significantly outperforms CLIP-based style transfer baselines on several datasets for the downstream task at hand, even surpassing one-shot unsupervised domain adaptation. A similar boost is observed on object detection and image classification

We perform zero-shot adaptation with natural language prompts. PØDA enables the adaptation of a segmenter model (here, DeepLabv3+ trained on the source dataset Cityscapes) to unseen conditions with only a prompt. Source-only predictions are shown as smaller segmentation masks to the left or right of the test images.

You Never Get a Second Chance To Make a Good First Impression: Seeding Active Learning for 3D Semantic Segmentation

Authors: Nermin Samet, Oriane Siméoni, Gilles Puy, Georgy Ponimatkin, Renaud Marlet, Vincent Lepetit

[Paper] [Code]

We are interested in the efficient annotation of sparse 3D point clouds (as captured indoors by depth cameras or outdoors by automotive lidars) for semantic segmentation. Active Learning (AL) iteratively selects relevant data fractions to annotate within a given budget, but requires a first fraction of the dataset (a ’seed’) to be already annotated to estimate the benefit of annotating other data fractions. We show that the choice of the seed can significantly affect the performance of many AL methods and propose a method, named SeedAL, for automatically constructing a seed that will ensure good performance for AL. Assuming that images of the point clouds are available, which is common, our method relies on powerful unsupervised image features to measure the diversity of the point clouds. It selects the point clouds for the seed by optimizing the diversity under an annotation budget, which can be done by solving a linear optimization problem. Our experiments demonstrate the effectiveness of our approach compared to random seeding and existing methods on both the S3DIS and SemanticKitti datasets.

Impact of active learning seed on performance. We show the variability of results obtained with 20 different random seeds (blue dashed lines), within an initial annotation budget of 3% of the dataset, when using various active learning methods for 3D semantic segmentation of S3DIS. We compare it to the result obtained with our seed selection strategy (solid red line), named SeedAL, which performs better or on par with the best (lucky) random seeds among 20, and “protects” from very bad (unlucky) random seeds.

eP-ALM: Efficient Perceptual Augmentation of Language Models

Authors: Mustafa Shukor, Corentin Dancette, Matthieu Cord

[Paper] [Code] [Project page]

eP-ALM aims to augment large language models (LLMs) with perception. While most existing approaches train a large number of parameters and rely on extensive multimodal pre-training, we investigate the minimal computational effort required to adapt unimodal models to multimodal tasks. We show that by freezing more than 99% of total parameters, training only one linear projection layer and prepending only one trainable token, our approach (dubbed eP-ALM) significantly outperforms other baselines on VQA and captioning for image, video and audio modalities.

Illustration of the adaptation mechanism in eP-ALM. The perceptual input (image/video/audio) is fed to the perceptual encoder E (e.g., ViT) and the corresponding text to the LM (e.g., OPT), which then generates a text conditioned on the perceptual input. The multimodal interaction is done via the [CLS] tokens acting as Perceptual Prompt, and are extracted from the last layers of the encoder, then injected in the last layers of LM, after passing by the Linear Connection C. The previous [CLS] token is replaced by the new one coming from a deeper layer, keeping the number of tokens fixed. The first layers (grayed) of each model are kept intact without any modality interaction. We ease the adaptation with a Soft Prompt that is prepended to the input of LM.

Zero-shot spatial layout conditioning for text-to-image diffusion models

Authors: Guillaume Couairon, Marlène Careil, Matthieu Cord, Stéphane Lathuilière, Jakob Verbeek

[Paper]

Large-scale text-to-image diffusion models have considerably improved the state of the art in generative image modeling, and provide an intuitive and powerful user interface to drive the image generation process. In this paper, we propose ZestGuide, a “zero-shot” segmentation guidance approach that can be integrated into pre-trained text-image diffusion models, and requires no additional training. It exploits the implicit segmentation maps that can be extracted from cross-attention layers, and uses them to align generation with input masks.

ZestGuide generates images conditioned on segmentation maps with corresponding free-form textual descriptions.

DiffHPE: Robust, Coherent 3D Human Pose Lifting with Diffusion

ICCV Workshop on Analysis and Modeling of Faces and Gestures

Authors: Cédric Rommel, Eduardo Valle, Mickaël Chen, Souhaiel Khalfaoui, Renaud Marlet, Matthieu Cord, Patrick Pérez

[Paper] [Code] [Project page]

Diffusion models are making waves across various domains, including computer vision, natural language processing and time-series analysis. However, its application to purely predictive tasks, such as 3D human pose estimation (3D-HPE), remains largely unexplored. While a few pioneering works have shown promising performance metrics in 3D-HPE, the understanding of the benefits of diffusion models over classical supervision — as well as key design choices — is still in its infancy. In this work, we address those concerns, providing an in-depth analysis of the effects of diffusion models on 3D-HPE.

Poses across the learned reverse diffusion process converge to an accurate 3D reconstruction of the corresponding 2D pose in pixel space.

More precisely, we propose DiffHPE, a novel strategy to use diffusion models in 3D-HPE, and show that combining diffusion with pre-trained supervised models allows to outperform both pure diffusion and pure supervised models trained separately. Our analysis demonstrates not only that the diffusion framework can be used to enhance accuracy, as previously understood, but also that it can improve robustness and coherence. Namely, our experiments showcase how poses estimated with diffusion models’ display better bilateral and temporal coherence, and are more robust to occlusions, even when not perfectly trained for the latter.

Challenges of Using Real-World Sensory Inputs for Motion Forecasting in Autonomous Driving

ROAD++: The Second Workshop and Challenge on Event Detection for Situation Awareness in Autonomous Driving

Authors: Yihong Xu, Loïck Chambon, Éloi Zablocki, Mickaël Chen, Matthieu Cord, Patrick Pérez

[Paper] [Project page]

Motion forecasting is crucial in enabling autonomous vehicles to anticipate the future trajectories of surrounding agents. To do so, it requires solving mapping, detection, tracking, and then forecasting problems, in a multi-step pipeline. In this complex system, advances in conventional forecasting methods have been made using curated data, i.e., with the assumption of perfect maps, detection, and tracking. This paradigm, however, ignores any errors from upstream modules. Meanwhile, an emerging end-to-end paradigm, that tightly integrates the perception and forecasting architectures into joint training, promises to solve this issue. So far, however, the evaluation protocols between the two methods were incompatible and their comparison was not possible. In fact, and perhaps surprisingly, conventional forecasting methods are usually not trained nor tested in real-world pipelines (e.g., with upstream detection, tracking, and mapping modules). In this work, we aim to bring forecasting models closer to real-world deployment. First, we propose a unified evaluation pipeline for forecasting methods with real-world perception inputs, allowing us to compare the performance of conventional and end-to-end methods for the first time. Second, our in-depth study uncovers a substantial performance gap when transitioning from curated to perception-based data. In particular, we show that this gap (1) stems not only from differences in precision but also from the nature of imperfect inputs provided by perception modules, and that (2) is not trivially reduced by simply finetuning on perception outputs. Based on extensive experiments, we provide recommendations for critical areas that require improvement and guidance towards more robust motion forecasting in the real world. We will release an evaluation library to benchmark models under standardized and practical conditions.

Study overview. We study the challenges of deploying motion forecasting models into the real world when only predicted perception inputs are available. We compare: (1) (top) "conventional methods" (i.e., methods trained on curated input data) where (middle) we directly replace the curated inputs with real-world data, and (2) (bottom) "end-to-end methods" that are trained and used with perception modules. In the real-world setting, evaluation is challenging as the past tracks are estimated with arbitrary identities, making it difficult to establish a direct correspondence to GT identities. Therefore, we propose a matching process (purple) to assign predictions to GT and thus evaluate forecasting performances. Moreover, we study in depth the impact changing from curated data (green) to real-world (orange) mapping, or detection and tracking errors to motion forecasting.

POP-3D: Open-Vocabulary 3D Occupancy Prediction from Images

ICCV 2023 Workshop on Open-Vocabulary 3D Scene Understanding (OpenSUN 3D)

Authors: Antonin Vobecky, Oriane Siméoni, David Hurych, Spyros Gidaris, Andrei Bursuc, Patrick Pérez, Josef Sivic

[Paper]

We propose an approach to predict a 3D semantic voxel occupancy map from input 2D images with features allowing 3D grounding, segmentation and retrieval of free-form language queries. To this end: We design a new architecture that consists of a 2D-3D encoder together with occupancy prediction and 3D-language heads; We develop a tri-modal self-supervised training that leverages three modalities – images, language and LiDAR point clouds– and enables learning the proposed architecture using a strong pre-trained vision-language model without the need for any 3D manual annotations. We quantitatively evaluate the proposed model on the task of zero-shot 3D semantic segmentation using existing datasets and show results on the tasks of 3D grounding and retrieval of free-form language queries.

Method overview.Given surround-view images, POP-3D produces a voxel grid of text-aligned features that support open-vocabulary downstream tasks such as zero-shot occupancy segmentation or text-based grounding and retrieval.

valeo.ai at CVPR 2023

2023-06-14T00:00:00+00:00

The IEEE / CVF Computer Vision and Pattern Recognition Conference (CVPR) is a key event for researchers and engineers working on computer vision and machine learning. At the 2023 edition the valeo.ai team will present six papers in the main conference, one workshop keynote and organize a tutorial. The team will be at CVPR to present these works and will be happy to discuss more about these projects and ideas, and share our exciting ongoing research. We outline four of our team papers below.

OCTET: Object-aware Counterfactual Explanations

Authors: Mehdi Zemni, Mickaël Chen, Éloi Zablocki, Hédi Ben-Younes, Patrick Pérez, Matthieu Cord

[Paper] [Code] [Video] [Project page]

Nowadays, deep vision models are being widely deployed in safety-critical applications, e.g., autonomous driving, and explainability of such models is becoming a pressing concern. Among explanation methods, counterfactual explanations aim to find minimal and interpretable changes to the input image that would also change the output of the model to be explained. Such explanations point end-users at the main factors that impact the decision of the model. However, previous methods struggle to explain decision models trained on images with many objects, e.g., urban scenes, which are more difficult to work with but also arguably more critical to explain. In this work, we propose to tackle this issue with an object-centric framework for counterfactual explanation generation. Our method, inspired by recent generative modeling works, encodes the query image into a latent space that is structured in a way to ease object-level manipulations. Doing so, it provides the end-user with control over which search directions (e.g., spatial displacement of objects, style modification, etc.) are to be explored during the counterfactual generation. We conduct a set of experiments on counterfactual explanation benchmarks for driving scenes, and we show that our method can be adapted beyond classification, e.g., to explain semantic segmentation models. To complete our analysis, we design and run a user study that measures the usefulness of counterfactual explanations in understanding a decision model.

Counterfactual explanations generated by OCTET. Given a classifier that predicts whether or not it is possible to go left, and a query image (top left), OCTET produces a counterfactual explanation where the most influential features that led to the decision are changed (top right). On the bottom row, we show that OCTET can also operate under different settings that result in different focused explanations. We report the prediction made by the decision model at the top left of each image.

ALSO: Automotive Lidar Self-supervision by Occupancy estimation

Authors: Alexandre Boulch, Corentin Sautier, Björn Michele, Gilles Puy, Renaud Marlet

[Paper] [Code] [Video] [Project page]

We propose a new self-supervised method for pre-training the backbone of deep perception models operating on point clouds. The core idea is to train the model on a pretext task which is the reconstruction of the surface on which the 3D points are sampled, and to use the underlying latent vectors as input to the perception head. The intuition is that if the network is able to reconstruct the scene surface, given only sparse input points, then it probably also captures some fragments of semantic information that can be used to boost an actual perception task. This principle has a very simple formulation, which makes it both easy to implement and widely applicable to a large range of 3D sensors and deep networks performing semantic segmentation or object detection. In fact, it supports a single-stream pipeline, as opposed to most contrastive learning approaches, allowing training on limited resources. We conducted extensive experiments on various autonomous driving datasets, involving very different kinds of lidars, for both semantic segmentation and object detection. The results show the effectiveness of our method to learn useful representations without any annotation, compared to existing approaches.

ALSO overview. The backbone to pre-train produces latent vectors for each input point. At pre-training time, the latent vector are fed into an volumetric occupancy head that classifies query points as full or empty. At semantic training or test time, the same latent vectors are fed into a semantic head, e.g., for semantic segmentation or object detection.

Unsupervised Object Localization: Observing the Background to Discover Objects

Authors: Oriane Siméoni, Chloé Sekkat, Gilles Puy, Antonin Vobecky, Éloi Zablocki, Patrick Pérez

[Paper] [Code] [Video] [Project page]

Recent advances in self-supervised visual representation learning have paved the way for unsupervised methods tackling tasks such as object discovery and instance segmentation. However, discovering objects in an image with no supervision is a very hard task; what are the desired objects, when to separate them into parts, how many are there, and of what classes? The answers to these questions depend on the tasks and datasets of evaluation. In this work, we take a different approach and propose to look for the background instead. This way, the salient objects emerge as a by-product without any strong assumption on what an object should be. We propose FOUND, a simple model made of a single conv1 × 1 initialized with coarse background masks extracted from self-supervised patch-based representations. After fast training and refining these seed masks, the model reaches state-of-the-art results on unsupervised saliency detection and object discovery benchmarks. Moreover, we show that our approach yields good results in the unsupervised semantic segmentation retrieval task.

Overview of FOUND. In the first stage (green upperpart), a background mask is discovered by mining a seed patch through a reweighting of the self-attention maps of a frozen DINO self-supervised features. This seed is then used to find similar patches likely belonging to the background. In the second stage (blue lower part), we train a lightweight 1 × 1 convolutional layer that produces refined masks from the self-supervised features. It is trained in a self-supervised fashion to predict both smoothed inverse coarse masks of the first step, and smoothed binarized version of its own output. Blue arrows denote where the gradients flow (in the reverse direction).

RangeViT: Towards Vision Transformers for 3D Semantic Segmentation in Autonomous Driving

Authors: Angelika Ando, Spyros Gidaris, Andrei Bursuc, Gilles Puy, Alexandre Boulch, Renaud Marlet

[Paper] [Code] [Video] [Project page]

Semantic segmentation of LiDAR point clouds permits vehicles to perceive their surrounding 3D environment independently of the lighting condition, providing useful information to build safe and reliable vehicles. A common approach to segment large scale LiDAR point clouds is to project the points on a 2D surface and then to use regular CNNs, originally designed for images, to process the projected point clouds. These projection-based methods usually benefit from fast computations and, when combined with techniques which use other point cloud representations, achieve state-of-the-art results.

Today, projection-based methods leverage 2D CNNs but recent advances in computer vision show that vision transformers (ViTs) have achieved state-of-the-art results in many image-based benchmarks. Despite the absence of almost any domain-specific inductive bias apart from the image tokenization process, ViTs have a strong representation learning capacity and achieve excellent results on various image perception tasks, such as image classification, object detection or semantic segmentation. Inspired by this success of ViTs for image understanding, in this work, we show that projection-based methods for 3D semantic segmentation can benefit from these latest improvements on ViTs when combined with three key ingredients, all described in our paper.

Exploiting vision transformer (ViT) architectures and weights for LiDAR point cloud semantic segmentation. We leverage the flexibility of transformer-based architectures to re-purpose them with minimal changes for processing sparse point clouds in autonomous driving tasks. The common ViT backbone across modalities allows to effectively transfer weights pre-trained on large image repositories towards improving point cloud segmentation performance with fine-tuning.

We answer positively but only after combining them with three key ingredients: (a) ViTs are notoriously hard to train and require a lot of training data to learn powerful representations. By preserving the same backbone architecture as for RGB images, we can exploit the knowledge from long training on large image collections that are much cheaper to acquire and annotate than point clouds. We reach our best results with pre-trained ViTs on large image datasets. (b) We compensate for ViTs’ lack of inductive bias by substituting a tailored non-linear convolutional stem for the classical linear embedding layer. (c) We refine pixel-wise predictions with a convolutional decoder and a skip connection from the convolutional stem to combine low-level but fine-grained features of the convolutional stem with the high-level but coarse predictions of the ViT encoder. With these ingredients, we show that our method, called RangeViT, outperforms prior projection-based methods on nuScenes and SemanticKITTI.

Overview of RangeViT architecture. First, the point cloud is projected in a 2D space with range projection. Then, the produced range image is processed by the convolutional stem, the ViT encoder and the decoder to obtain a 2D feature map. It is then processed by a 3D refiner layer for 3D point-wise predictions. Note that there is a single skip connection between the convolutional stem and the decoder.

In summary, our work offers the following contributions:

Exploiting the powerful representation learning capacity of vision transformers for LiDAR semantic segmentation.
Unifying the network architectures for processing LiDAR point clouds and images, enabling advancements in one domain to benefit both.
Demonstrating the utilization of pre-trained ViTs on large-scale natural image datasets for LiDAR point cloud segmentation.

We believe that this finding is highly intriguing. The RangeViT approach can leverage off-the-shelf pre-trained ViT models, enabling direct benefits from ongoing and future advances in training ViT models with natural RGB images - a rapidly growing research field.