liu.seSearch for publications in DiVA
Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • oxford
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
S3A: Towards Realistic Zero-Shot Classification via Self Structural Semantic Alignment
Mohamed bin Zayed Univ Artificial Intelligence, U Arab Emirates.
Mohamed bin Zayed Univ Artificial Intelligence, U Arab Emirates.
Mohamed bin Zayed Univ Artificial Intelligence, U Arab Emirates; Carnegie Mellon Univ, PA USA.
Mohamed bin Zayed Univ Artificial Intelligence, U Arab Emirates.
Show others and affiliations
2024 (English)In: THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 7, ASSOC ADVANCEMENT ARTIFICIAL INTELLIGENCE , 2024, p. 7278-7286Conference paper, Published paper (Refereed)
Abstract [en]

Large-scale pre-trained Vision Language Models (VLMs) have proven effective for zero-shot classification. Despite the success, most traditional VLMs-based methods are restricted by the assumption of partial source supervision or ideal target vocabularies, which rarely satisfy the open-world scenario. In this paper, we aim at a more challenging setting, Realistic Zero-Shot Classification, which assumes no annotation but instead a broad vocabulary. To address the new problem, we propose the Self Structural Semantic Alignment (S3A) framework, which extracts the structural semantic information from unlabeled data while simultaneously selflearning. Our S3A framework adopts a unique Cluster-Vote-Prompt-Realign (CVPR) algorithm, which iteratively groups unlabeled data to derive structural semantics for pseudo-supervision. Our CVPR algorithm includes iterative clustering on images, voting within each cluster to identify initial class candidates from the vocabulary, generating discriminative prompts with large language models to discern confusing candidates, and realigning images and the vocabulary as structural semantic alignment. Finally, we propose to self-train the CLIP image encoder with both individual and structural semantic alignment through a teacher-student learning strategy. Our comprehensive experiments across various generic and fine-grained benchmarks demonstrate that the S3A method substantially improves over existing VLMs-based approaches, achieving a more than 15% accuracy improvement over CLIP on average. Our codes, models, and prompts are publicly released at https://github.com/shengeatamath/S3A.

Place, publisher, year, edition, pages
ASSOC ADVANCEMENT ARTIFICIAL INTELLIGENCE , 2024. p. 7278-7286
Series
AAAI Conference on Artificial Intelligence, ISSN 2159-5399
National Category
Natural Language Processing
Identifiers
URN: urn:nbn:se:liu:diva-207264DOI: 10.1609/aaai.v38i7.28557ISI: 001239937300087OAI: oai:DiVA.org:liu-207264DiVA, id: diva2:1895691
Conference
38th AAAI Conference on Artificial Intelligence (AAAI) / 36th Conference on Innovative Applications of Artificial Intelligence / 14th Symposium on Educational Advances in Artificial Intelligence, Vancouver, CANADA, feb 20-27, 2024
Available from: 2024-09-06 Created: 2024-09-06 Last updated: 2025-02-07

Open Access in DiVA

No full text in DiVA

Other links

Publisher's full text

Search in DiVA

By author/editor
Khan, Fahad
By organisation
Computer VisionFaculty of Science & Engineering
Natural Language Processing

Search outside of DiVA

GoogleGoogle Scholar

doi
urn-nbn

Altmetric score

doi
urn-nbn
Total: 66 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • oxford
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf