LLM-wrapper: Black-Box Semantic-Aware Adaptation of Vision-Language Foundation Models

Amaia Cardiel    Éloi Zablocki    Oriane Siméoni    Elias Ramzi    Matthieu Cord

ECCV Workshop EVAL-FoMo 2024

Paper  

project teaser

Abstract

Vision Language Models (VLMs) have shown impressive performances on numerous tasks but their zero-shot capabilities can be limited compared to dedicated or fine-tuned models. Yet, fine-tuning VLMs comes with strong limitations as it requires a ‘white-box’ access to the model’s architecture and weights while some recent models are proprietary (e.g., Grounding DINO 1.5). It also requires expertise to design the fine-tuning objectives and optimize the hyper-parameters, which are specific to each VLM and downstream task. In this work, we propose LLM-wrapper, a novel approach to adapt VLMs in a `black-box' and semantic-aware manner by leveraging large language models (LLMs) so as to reason on their outputs. We demonstrate the effectiveness of LLM-wrapper on Referring Expression Comprehension (REC), a challenging open-vocabulary task that requires spatial and semantic reasoning. Our approach significantly boosts the performance of off-the-shelf models, resulting in competitive or on par results when compared with classic VLM fine-tuning.


BibTeX

@misc{cardiel2024llmwrapper,
  title={LLM-wrapper: Black-Box Semantic-Aware Adaptation of Vision-Language Foundation Models},
  author    = {Amaia Cardiel and
               \'{E}loi Zablocki and
               Oriane Sim\'{e}oni and
               Elias Ramzi and
               Matthieu Cord},
      year={2024},
      eprint={2409.11919},
      archivePrefix={arXiv},
}