Class-Agnostic Object Detection with Multi-modal TransformerShow others and affiliations
2022 (English)In: COMPUTER VISION, ECCV 2022, PT X, SPRINGER INTERNATIONAL PUBLISHING AG , 2022, Vol. 13670, p. 512-531Conference paper, Published paper (Refereed)
Abstract [en]
What constitutes an object? This has been a long-standing question in computer vision. Towards this goal, numerous learning-free and learning-based approaches have been developed to score objectness. However, they generally do not scale well across new domains and novel objects. In this paper, we advocate that existing methods lack a top-down supervision signal governed by human-understandable semantics. For the first time in literature, we demonstrate that Multi-modal Vision Transformers (MViT) trained with aligned image-text pairs can effectively bridge this gap. Our extensive experiments across various domains and novel objects show the state-of-the-art performance of MViTs to localize generic objects in images. Based on the observation that existing MViTs do not include multi-scale feature processing and usually require longer training schedules, we develop an efficient MViT architecture using multi-scale deformable attention and late vision-language fusion. We show the significance of MViT proposals in a diverse range of applications including open-world object detection, salient and camouflage object detection, supervised and self-supervised detection tasks. Further, MViTs can adaptively generate proposals given a specific language query and thus offer enhanced interactability.
Place, publisher, year, edition, pages
SPRINGER INTERNATIONAL PUBLISHING AG , 2022. Vol. 13670, p. 512-531
Series
Lecture Notes in Computer Science, ISSN 0302-9743
Keywords [en]
Object detection; Class-agnostic; Vision transformers
National Category
Computer graphics and computer vision
Identifiers
URN: urn:nbn:se:liu:diva-191235DOI: 10.1007/978-3-031-20080-9_30ISI: 000897089200030ISBN: 9783031200793 (print)ISBN: 9783031200809 (electronic)OAI: oai:DiVA.org:liu-191235DiVA, id: diva2:1731528
Conference
17th European Conference on Computer Vision (ECCV), Tel Aviv, ISRAEL, oct 23-27, 2022
Note
Funding Agencies|NSF CAREER [1149783]; VR starting grant [2016-05543]
2023-01-272023-01-272025-02-07