liu.seSearch for publications in DiVA
Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • oxford
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
S3PT: Scene Semantics and Structure Guided Clustering to Boost Self-Supervised Pre-Training for Autonomous Driving
KTH Royal Institute of Technology.ORCID iD: 0000-0002-3432-6151
Linköping University, Department of Computer and Information Science, The Division of Statistics and Machine Learning. Linköping University, Faculty of Science & Engineering.ORCID iD: 0000-0003-3428-6564
Qualcomm Technologies International GmbH.
Qualcomm Technologies International GmbH.
Show others and affiliations
2025 (English)In: Proceedings of the Winter Conference on Applications of Computer Vision (WACV), Tucson, USA, 2025, 2025Conference paper, Published paper (Refereed)
Abstract [en]

Recent self-supervised clustering-based pre-training techniques like DINO and CrIBo have shown impressive results for downstream detection and segmentation tasks. However, real-world applications such as autonomous driving face challenges with imbalanced object class and size distributions and complex scene geometries. In this paper, we propose S3PT a novel scene semantics and structure guided clustering to provide more scene-consistent objectives for self-supervised training. Specifically, our contributions are threefold: First, we incorporate semantic distribution consistent clustering to encourage better representation of rare classes such as motorcycles or animals. Second, we introduce object diversity consistent spatial clustering, to handle imbalanced and diverse object sizes, ranging from large background areas to small objects such as pedestrians and traffic signs. Third, we propose a depth-guided spatial clustering to regularize learning based on geometric information of the scene, thus further refining region separation on the feature level. Our learned representations significantly improve performance in downstream semantic segmentation and 3D object detection tasks on the nuScenes, nuImages, and Cityscapes datasets and show promising domain translation properties.

Place, publisher, year, edition, pages
2025.
Keywords [en]
self-supervised learning, vision transformers, dino, few-shot learning, clustering based methods, representation learning, autonomous driving, depth guided clustering
National Category
Computer graphics and computer vision
Identifiers
URN: urn:nbn:se:liu:diva-212817OAI: oai:DiVA.org:liu-212817DiVA, id: diva2:1950099
Conference
Winter Conference on Applications of Computer Vision (WACV)
Funder
Wallenberg AI, Autonomous Systems and Software Program (WASP)Available from: 2025-04-04 Created: 2025-04-04 Last updated: 2025-04-04

Open Access in DiVA

No full text in DiVA

Other links

Paper Page

Authority records

Govindarajan, Hariprasath

Search in DiVA

By author/editor
Wozniak, MaciejGovindarajan, Hariprasath
By organisation
The Division of Statistics and Machine LearningFaculty of Science & Engineering
Computer graphics and computer vision

Search outside of DiVA

GoogleGoogle Scholar

urn-nbn

Altmetric score

urn-nbn
Total: 45 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • oxford
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf