Vita-CLIP: Video and text adaptive CLIP via Multimodal PromptingShow others and affiliations
2023 (English)In: 2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), IEEE COMPUTER SOC , 2023, p. 23034-23044Conference paper, Published paper (Refereed)
Abstract [en]
Adopting contrastive image-text pretrained models like CLIP towards video classification has gained attention due to its cost-effectiveness and competitive performance. However, recent works in this area face a trade-off. Fine-tuning the pretrained model to achieve strong supervised performance results in low zero-shot generalization. Similarly, freezing the backbone to retain zero-shot capability causes significant drop in supervised accuracy. Because of this, recent works in literature typically train separate models for supervised and zero-shot action recognition. In this work, we propose a multimodal prompt learning scheme that works to balance the supervised and zero-shot performance under a single unified training. Our prompting approach on the vision side caters for three aspects: 1) Global video-level prompts to model the data distribution; 2) Local frame-level prompts to provide per-frame discriminative conditioning; and 3) a summary prompt to extract a condensed video representation. Additionally, we define a prompting scheme on the text side to augment the textual context. Through this prompting scheme, we can achieve state-of-the-art zero-shot performance on Kinetics-600, HMDB51 and UCF101 while remaining competitive in the supervised setting. By keeping the pretrained backbone frozen, we optimize a much lower number of parameters and retain the existing general representation which helps achieve the strong zero-shot performance. Our codes/models will be released at https://github.com/TalalWasim/Vita-CLIP..
Place, publisher, year, edition, pages
IEEE COMPUTER SOC , 2023. p. 23034-23044
Series
IEEE Conference on Computer Vision and Pattern Recognition, ISSN 1063-6919, E-ISSN 2575-7075
National Category
Computer graphics and computer vision
Identifiers
URN: urn:nbn:se:liu:diva-199355DOI: 10.1109/CVPR52729.2023.02206ISI: 001062531307035ISBN: 9798350301298 (electronic)ISBN: 9798350301304 (print)OAI: oai:DiVA.org:liu-199355DiVA, id: diva2:1815364
Conference
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, CANADA, jun 17-24, 2023
2023-11-282023-11-282025-02-07