CLIP's Visual Embedding Projector is a Few-shot Cornucopia

Created by MG96

External Public cs.CV

Statistics

Citations
2
References
45
Last updated
Loading...
Authors

Mohammad Fahes Tuan-Hung Vu Andrei Bursuc Patrick Pérez Raoul de Charette
Project Resources

Name Type Source Actions
ArXiv Paper Paper arXiv
Semantic Scholar Paper Semantic Scholar
GitHub Repository Code Repository GitHub
Abstract

We consider the problem of adapting a contrastively pretrained vision-language model like CLIP (Radford et al., 2021) for few-shot classification. The literature addresses this problem by learning a linear classifier of the frozen visual features, optimizing word embeddings, or learning external feature adapters. We introduce an alternative way for few-shot CLIP adaptation without adding ''external'' parameters to optimize. We find that simply fine-tuning the embedding projection matrix of the vision encoder leads to better performance than all baselines. Furthermore, we show that regularizing training with the distance between the fine-tuned and pretrained matrices adds reliability for adapting CLIP, making the results stable across different learning rates in the ''validation-free'' setting. This simple approach, coined ProLIP, yields state-of-the-art performance on 11 few-shot classification benchmarks, few-shot cross-dataset transfer, domain generalization, and base-to-new class generalization. We also show that ProLIP significantly outperforms prompt tuning when extended to another task of test-time adaptation, while being one order of magnitude faster to train. Code will be made available at: https://github.com/astra-vision/ProLIP .

Note:

No note available for this project.

No note available for this project.
Contact:

No contact available for this project.

No contact available for this project.