liu.seSearch for publications in DiVA
Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • oxford
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Vita-CLIP: Video and text adaptive CLIP via Multimodal Prompting
Mohamed Bin Zayed Univ AI, U Arab Emirates.
Mohamed Bin Zayed Univ AI, U Arab Emirates.
Mohamed Bin Zayed Univ AI, U Arab Emirates; Australian Natl Univ, Australia.
Linköping University, Department of Electrical Engineering, Computer Vision. Linköping University, Faculty of Science & Engineering. Mohamed Bin Zayed Univ AI, U Arab Emirates.
Show others and affiliations
2023 (English)In: 2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), IEEE COMPUTER SOC , 2023, p. 23034-23044Conference paper, Published paper (Refereed)
Abstract [en]

Adopting contrastive image-text pretrained models like CLIP towards video classification has gained attention due to its cost-effectiveness and competitive performance. However, recent works in this area face a trade-off. Fine-tuning the pretrained model to achieve strong supervised performance results in low zero-shot generalization. Similarly, freezing the backbone to retain zero-shot capability causes significant drop in supervised accuracy. Because of this, recent works in literature typically train separate models for supervised and zero-shot action recognition. In this work, we propose a multimodal prompt learning scheme that works to balance the supervised and zero-shot performance under a single unified training. Our prompting approach on the vision side caters for three aspects: 1) Global video-level prompts to model the data distribution; 2) Local frame-level prompts to provide per-frame discriminative conditioning; and 3) a summary prompt to extract a condensed video representation. Additionally, we define a prompting scheme on the text side to augment the textual context. Through this prompting scheme, we can achieve state-of-the-art zero-shot performance on Kinetics-600, HMDB51 and UCF101 while remaining competitive in the supervised setting. By keeping the pretrained backbone frozen, we optimize a much lower number of parameters and retain the existing general representation which helps achieve the strong zero-shot performance. Our codes/models will be released at https://github.com/TalalWasim/Vita-CLIP..

Place, publisher, year, edition, pages
IEEE COMPUTER SOC , 2023. p. 23034-23044
Series
IEEE Conference on Computer Vision and Pattern Recognition, ISSN 1063-6919, E-ISSN 2575-7075
National Category
Computer graphics and computer vision
Identifiers
URN: urn:nbn:se:liu:diva-199355DOI: 10.1109/CVPR52729.2023.02206ISI: 001062531307035ISBN: 9798350301298 (electronic)ISBN: 9798350301304 (print)OAI: oai:DiVA.org:liu-199355DiVA, id: diva2:1815364
Conference
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, CANADA, jun 17-24, 2023
Available from: 2023-11-28 Created: 2023-11-28 Last updated: 2025-02-07

Open Access in DiVA

No full text in DiVA

Other links

Publisher's full text

Search in DiVA

By author/editor
Khan, Fahad
By organisation
Computer VisionFaculty of Science & Engineering
Computer graphics and computer vision

Search outside of DiVA

GoogleGoogle Scholar

doi
isbn
urn-nbn

Altmetric score

doi
isbn
urn-nbn
Total: 67 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • oxford
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf