liu.seSearch for publications in DiVA
Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • oxford
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Video Instance Segmentation via Multi-Scale Spatio-Temporal Split Attention Transformer
MBZUAI, U Arab Emirates.
IIAI, U Arab Emirates.
Tianjin Univ, Peoples R China.
MBZUAI, U Arab Emirates.
Show others and affiliations
2022 (English)In: COMPUTER VISION, ECCV 2022, PT XXIX, SPRINGER INTERNATIONAL PUBLISHING AG , 2022, Vol. 13689, p. 666-681Conference paper, Published paper (Refereed)
Abstract [en]

State-of-the-art transformer-based video instance segmentation (VIS) approaches typically utilize either single-scale spatio-temporal features or per-frame multi-scale features during the attention computations. We argue that such an attention computation ignores the multiscale spatio-temporal feature relationships that are crucial to tackle target appearance deformations in videos. To address this issue, we propose a transformer-based VIS framework, named MS-STS VIS, that comprises a novel multi-scale spatio-temporal split (MS-STS) attention module in the encoder. The proposed MS-STS module effectively captures spatio-temporal feature relationships at multiple scales across frames in a video. We further introduce an attention block in the decoder to enhance the temporal consistency of the detected instances in different frames of a video. Moreover, an auxiliary discriminator is introduced during training to ensure better foreground-background separability within the multiscale spatio-temporal feature space. We conduct extensive experiments on two benchmarks: Youtube-VIS (2019 and 2021). Our MS-STS VIS achieves state-of-the-art performance on both benchmarks. When using the ResNet50 backbone, our MS-STS achieves a mask AP of 50.1%, outperforming the best reported results in literature by 2.7% and by 4.8% at higher overlap threshold of AP75, while being comparable in model size and speed on Youtube-VIS 2019 val. set. When using the Swin Transformer backbone, MS-STS VIS achieves mask AP of 61.0% on Youtube-VIS 2019 val. set.

Place, publisher, year, edition, pages
SPRINGER INTERNATIONAL PUBLISHING AG , 2022. Vol. 13689, p. 666-681
Series
Lecture Notes in Computer Science, ISSN 0302-9743, E-ISSN 1611-3349
National Category
Computer Vision and Robotics (Autonomous Systems)
Identifiers
URN: urn:nbn:se:liu:diva-191406DOI: 10.1007/978-3-031-19818-2_38ISI: 000903735000038ISBN: 9783031198175 (print)ISBN: 9783031198182 (electronic)OAI: oai:DiVA.org:liu-191406DiVA, id: diva2:1733507
Conference
17th European Conference on Computer Vision (ECCV), Tel Aviv, ISRAEL, oct 23-27, 2022
Note

Funding Agencies|VR [2016-05543, 2018-04673]; WASP; ELLIIT; SNIC [2018-05973]

Available from: 2023-02-02 Created: 2023-02-02 Last updated: 2023-02-02

Open Access in DiVA

No full text in DiVA

Other links

Publisher's full text

Search in DiVA

By author/editor
Felsberg, MichaelKhan, Fahad
By organisation
Computer VisionFaculty of Science & Engineering
Computer Vision and Robotics (Autonomous Systems)

Search outside of DiVA

GoogleGoogle Scholar

doi
isbn
urn-nbn

Altmetric score

doi
isbn
urn-nbn
Total: 18 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • oxford
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf