Vision Language Models (VLMs) have shown impressive performances on numerous tasks but their zero-shot capabilities can be limited compared to dedicated or fine-tuned models. Yet, fine-tuning VLMs comes with strong limitations as it requires a ‘white-box’ access to the model’s architecture and weights while some recent models are proprietary (e.g., Grounding DINO 1.5). It also requires expertise to design the fine-tuning objectives and optimize the hyper-parameters, which are specific to each VLM and downstream task. In this work, we propose LLM-wrapper, a novel approach to adapt VLMs in a `black-box' and semantic-aware manner by leveraging large language models (LLMs) so as to reason on their outputs. We demonstrate the effectiveness of LLM-wrapper on Referring Expression Comprehension (REC), a challenging open-vocabulary task that requires spatial and semantic reasoning. Our approach significantly boosts the performance of off-the-shelf models, resulting in competitive or on par results when compared with classic VLM fine-tuning.
@misc{cardiel2024llmwrapper, title={LLM-wrapper: Black-Box Semantic-Aware Adaptation of Vision-Language Foundation Models}, author = {Amaia Cardiel and \'{E}loi Zablocki and Oriane Sim\'{e}oni and Elias Ramzi and Matthieu Cord}, year={2024}, eprint={2409.11919}, archivePrefix={arXiv}, }