valeo.ai at ICCV 2025 | valeo.ai - valeo.ai research page

The International Conference on Computer Vision (ICCV) is a leading conference that brings together researchers and practitioners in computer vision and machine learning. At the 2025 edition, the valeo.ai team will present five papers in the main conference. We are also co-organizing the Learning to See: Advancing Spatial Understanding for Embodied Intelligence workshop, and contributing to the Foundational Data for Industrial Tech Transfer workshop with a keynote on Towards openness of vision foundation models.

The team will be at ICCV to present these works, exchange ideas, and share our exciting ongoing research. We look forward to seeing you in Honolulu!

valeo.ai papers at ICCV 2025

valeo.ai team at ICCV 2025

DIP: Unsupervised Dense In-Context Post-training of Visual Representations

Authors: Sophia Sirko-Galouchenko, Antonin Vobecky, Andrei Bursuc, Nicolas Thome, Spyros Gidaris

[Paper] [Code]

dip_teaser

We introduce DIP, a novel unsupervised post-training method designed to enhance dense image representations in large-scale pretrained vision encoders for in-context scene understanding. Unlike prior approaches that rely on complex self-distillation architectures, our method trains the vision encoder using pseudo-tasks that explicitly simulate downstream in-context scenarios, inspired by meta-learning principles. To enable post-training on unlabeled data, we propose an automatic mechanism for generating in-context tasks that combines a pretrained diffusion model and the vision encoder itself. DIP is simple, unsupervised, and computationally efficient, requiring less than 9 hours on a single A100 GPU. By learning dense representations through pseudo in-context tasks, it achieves strong performance across a wide variety of downstream real-world in-context scene understanding tasks. It outperforms both the initial vision encoder and prior methods, offering a practical and effective solution for improving dense representations.

GaussRender: Learning 3D Occupancy with Gaussian Rendering

Authors: Loïck Chambon, Éloi Zablocki, Alexandre Boulch, Mickaël Chen, Matthieu Cord

[Paper] [Code]

gaussrender_teaser

Understanding the 3D geometry and semantics of driving scenes is critical for safe autonomous driving. Recent advances in 3D occupancy prediction improve scene representation but often suffer from spatial inconsistencies, leading to floating artifacts and poor surface localization. Existing voxel-wise losses (e.g., cross-entropy) fail to enforce geometric coherence. GaussRender is a module that improves 3D occupancy learning by enforcing projective consistency. The key idea is to project both predicted and ground-truth 3D occupancy into 2D camera views for supervision, penalizing inconsistent 3D configurations and enforcing coherent 3D structure. To achieve this efficiently, GaussRender leverages differentiable rendering with Gaussian splatting. It integrates seamlessly with existing architectures, requires no inference-time modifications, and significantly improves geometric fidelity across multiple benchmarks (SurroundOcc-nuScenes, Occ3D-nuScenes, SSCBench-KITTI360) and 3D occupancy models (TPVFormer, SurroundOcc, Symphonies).

Scene visualization 1

Scene visualization 2

GaussRender is a 3D Occupancy module that can be plugged into any 3D Occupancy model to enhance its predictions and ensure 2D-3D consistency while improving mIoU, IoU, and RayIoU.

GaussRender can be plugged into any model. The core idea is to transform voxels into Gaussians before performing a depth and semantic rendering.

Results

GaussRender can be plugged into any 3D model. Dedicated experiments on multiple 3D benchmarks (SurroundOcc-nuScenes, Occ3D-nuScenes, SSCBench-KITTI360) and models (TPVFormer, SurroundOcc, Symphonies) demonstrate its performance.

Occ3D-nuScenes

Models	TPVFormer (ours)	TPVFormer	SurroundOcc (ours)	SurroundOcc	OccFormer	RenderOcc
Type	w/ GaussRender	base	w/ GaussRender	base	base	base
mIoU	30.48 🥇 (+2.65)	27.83	30.38 🥈 (+1.17)	29.21	21.93	26.11
RayIoU	38.3 🥇 (+1.1)	37.2	37.5 🥈 (+2.0)	35.5	-	19.5

SurroundOcc-nuScenes

Models	TPVFormer (ours)	TPVFormer	SurroundOcc (ours)	SurroundOcc	OccFormer	GaussianFormerv2
Type	w/ GaussRender	base	w/ GaussRender	base	base	base
IoU	32.05 🥈 (+1.19)	30.86	32.61 🥇 (+1.12)	31.49	31.39	30.56
mIoU	20.58 🥈 (+3.48)	17.10	20.82 🥇 (+0.52)	20.30	19.03	20.02

SSCBench-KITTI360

Models	SurroundOcc (ours)	SurroundOcc	Symphonies (ours)	Symphonies	OccFormer	MonoScene
Type	w/ GaussRender	base	w/ GaussRender	base	base	base
IoU	38.62 (+0.11)	38.51	44.08 🥇 (+0.68)	43.40 🥈	40.27	37.87
mIoU	13.34 (+0.26)	13.08	18.11 🥇 (+0.29)	17.82 🥈	13.81	12.31

MoSiC: Optimal-Transport Motion Trajectories for Dense Self-Supervised Learning

Authors: Mohammadreza Salehi, Shashanka Venkataramanan, Ioana Simion, Efstratios Gavves, Cees Snoek, Yuki Asano

[Paper] [Code]

mosic_teaser

Dense self-supervised learning has shown great promise for learning pixel- and patch-level representations, but extending it to videos remains challenging due to complex motion dynamics. Existing approaches struggle under object deformations, occlusions, and camera movement, leading to inconsistent feature learning over time. In this work, we introduce MoSiC, a motion-guided self-supervised framework that clusters dense point tracks to learn spatiotemporally consistent representations. Using an off-the-shelf point tracker, we extract long-range motion trajectories and optimize feature clustering with a momentum-encoder-based optimal transport mechanism. Temporal coherence is enforced by propagating cluster assignments along tracked points, ensuring feature consistency across views despite viewpoint changes. By leveraging motion as an implicit supervisory signal and initializing from strong image-pretrained models, MoSiC learns robust representations that generalize across frames. Our approach improves state-of-the-art performance by 1% to 6% across six image and video datasets and four evaluation benchmarks.

FLOSS: Free Lunch in Open-vocabulary Semantic Segmentation

Authors: Yasser Benigmim, Mohammad Fahes, Tuan-Hung Vu, Andrei Bursuc, Raoul de Charette

[Paper] [Code]

floss_teaser

In Open-Vocabulary Semantic Segmentation (OVSS), class-wise text embeddings are usually averaged over multiple templates (e.g., “a photo of ", "a sketch of ") to form classifiers. We show that for each class, there exist single-template classifiers—termed class-experts—that outperform conventional averaged classifiers. To identify these class-experts without labeled data, we leverage class-wise prediction entropy and select the lowest-entropy classifiers as the most reliable. We then fuse the outputs of these class-experts using a novel plug-and-play process called FLOSS. FLOSS is complementary to existing OVSS methods, requiring no additional labels or training, yet consistently improves performance. Extensive experiments demonstrate that FLOSS enhances state-of-the-art OVSS models, generalizes across datasets with distribution shifts, and yields strong improvements in low-data scenarios with only a few unlabeled images.

Analyzing Fine-tuning Representation Shift for Multimodal LLMs Steering Alignment

Authors: Pegah Khayatan, Mustafa Shukor, Jayneel Parekh, Matthieu Cord

[Paper] [Code]

xlvlm_teaser

Multimodal LLMs have achieved remarkable proficiency in understanding multimodal inputs, yet little attention has been given to explaining how their internal representations evolve during training. Most explainability research focuses only on final model states, ignoring dynamic representational shifts. In this work, we systematically analyze the evolution of hidden state representations during fine-tuning, revealing how models adapt to new multimodal tasks. Using a concept-based approach, we map hidden states to interpretable visual and textual concepts, enabling us to trace concept changes across modalities as training progresses. We introduce shift vectors to capture these changes, allowing recovery of fine-tuned concepts from the original model. Furthermore, we demonstrate practical applications in model steering, such as adjusting answer types, caption styles, or biasing responses without additional training. This work provides novel insights into multimodal representation adaptation and offers tools for interpreting and controlling fine-tuned multimodal LLMs.