liu.seSearch for publications in DiVA
Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • oxford
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Video Instance Segmentation in an Open-World
Mohamed bin Zayed Univ AI, U Arab Emirates.
Technol Innovat Inst, U Arab Emirates.
Mohamed bin Zayed Univ AI, U Arab Emirates.
Mohamed bin Zayed Univ AI, U Arab Emirates; Aalto Univ, Finland.
Show others and affiliations
2025 (English)In: International Journal of Computer Vision, ISSN 0920-5691, E-ISSN 1573-1405, Vol. 133, no 1, p. 398-409Article in journal (Refereed) Published
Abstract [en]

Existing video instance segmentation (VIS) approaches generally follow a closed-world assumption, where only seen category instances are identified and spatio-temporally segmented at inference. Open-world formulation relaxes the close-world static-learning assumption as follows: (a) first, it distinguishes a set of known categories as well as labels an unknown object as 'unknown' and then (b) it incrementally learns the class of an unknown as and when the corresponding semantic labels become available. We propose the first open-world VIS approach, named OW-VISFormer, that introduces a novel feature enrichment mechanism and a spatio-temporal objectness (STO) module. The feature enrichment mechanism based on a light-weight auxiliary network aims at accurate pixel-level (unknown) object delineation from the background as well as distinguishing category-specific known semantic classes. The STO module strives to generate instance-level pseudo-labels by enhancing the foreground activations through a contrastive loss. Moreover, we also introduce an extensive experimental protocol to measure the characteristics of OW-VIS. Our OW-VISFormer performs favorably against a solid baseline in OW-VIS setting. Further, we evaluate our contributions in the standard fully-supervised VIS setting by integrating them into the recent SeqFormer, achieving an absolute gain of 1.6% AP on Youtube-VIS 2019 val. set. Lastly, we show the generalizability of our contributions for the open-world detection (OWOD) setting, outperforming the best existing OWOD method in the literature. Code, models along with OW-VIS splits are available at https://github.com/OmkarThawakar/OWVISFormer.

Place, publisher, year, edition, pages
SPRINGER , 2025. Vol. 133, no 1, p. 398-409
Keywords [en]
Open-world segmentation; Video instance segmentation; Object-detection; Video object detection; Transformers
National Category
Computer graphics and computer vision
Identifiers
URN: urn:nbn:se:liu:diva-207218DOI: 10.1007/s11263-024-02195-4ISI: 001283824600001Scopus ID: 2-s2.0-85200025862OAI: oai:DiVA.org:liu-207218DiVA, id: diva2:1895188
Note

Funding Agencies|Swedish Research Council [2022-06725]

Available from: 2024-09-05 Created: 2024-09-05 Last updated: 2025-10-07Bibliographically approved

Open Access in DiVA

No full text in DiVA

Other links

Publisher's full textScopus

Authority records

Khan, Fahad Shahbaz

Search in DiVA

By author/editor
Khan, Fahad Shahbaz
By organisation
Computer VisionFaculty of Science & Engineering
In the same journal
International Journal of Computer Vision
Computer graphics and computer vision

Search outside of DiVA

GoogleGoogle Scholar

doi
urn-nbn

Altmetric score

doi
urn-nbn
Total: 63 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • oxford
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf