liu.seSearch for publications in DiVA
Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • oxford
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Class-Agnostic Object Detection with Multi-modal Transformer
Mohamed bin Zayed Univ, U Arab Emirates.
Mohamed bin Zayed Univ, U Arab Emirates.
Mohamed bin Zayed Univ, U Arab Emirates; Australian Natl Univ, Australia.
Linköping University, Department of Electrical Engineering, Computer Vision. Linköping University, Faculty of Science & Engineering. Mohamed bin Zayed Univ, U Arab Emirates.
Show others and affiliations
2022 (English)In: COMPUTER VISION, ECCV 2022, PT X, SPRINGER INTERNATIONAL PUBLISHING AG , 2022, Vol. 13670, p. 512-531Conference paper, Published paper (Refereed)
Abstract [en]

What constitutes an object? This has been a long-standing question in computer vision. Towards this goal, numerous learning-free and learning-based approaches have been developed to score objectness. However, they generally do not scale well across new domains and novel objects. In this paper, we advocate that existing methods lack a top-down supervision signal governed by human-understandable semantics. For the first time in literature, we demonstrate that Multi-modal Vision Transformers (MViT) trained with aligned image-text pairs can effectively bridge this gap. Our extensive experiments across various domains and novel objects show the state-of-the-art performance of MViTs to localize generic objects in images. Based on the observation that existing MViTs do not include multi-scale feature processing and usually require longer training schedules, we develop an efficient MViT architecture using multi-scale deformable attention and late vision-language fusion. We show the significance of MViT proposals in a diverse range of applications including open-world object detection, salient and camouflage object detection, supervised and self-supervised detection tasks. Further, MViTs can adaptively generate proposals given a specific language query and thus offer enhanced interactability.

Place, publisher, year, edition, pages
SPRINGER INTERNATIONAL PUBLISHING AG , 2022. Vol. 13670, p. 512-531
Series
Lecture Notes in Computer Science, ISSN 0302-9743
Keywords [en]
Object detection; Class-agnostic; Vision transformers
National Category
Computer graphics and computer vision
Identifiers
URN: urn:nbn:se:liu:diva-191235DOI: 10.1007/978-3-031-20080-9_30ISI: 000897089200030ISBN: 9783031200793 (print)ISBN: 9783031200809 (electronic)OAI: oai:DiVA.org:liu-191235DiVA, id: diva2:1731528
Conference
17th European Conference on Computer Vision (ECCV), Tel Aviv, ISRAEL, oct 23-27, 2022
Note

Funding Agencies|NSF CAREER [1149783]; VR starting grant [2016-05543]

Available from: 2023-01-27 Created: 2023-01-27 Last updated: 2025-02-07

Open Access in DiVA

No full text in DiVA

Other links

Publisher's full text

Search in DiVA

By author/editor
Khan, Fahad
By organisation
Computer VisionFaculty of Science & Engineering
Computer graphics and computer vision

Search outside of DiVA

GoogleGoogle Scholar

doi
isbn
urn-nbn

Altmetric score

doi
isbn
urn-nbn
Total: 110 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • oxford
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf