liu.seSearch for publications in DiVA
Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • oxford
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
GASP: Unifying Geometric and Semantic Self-Supervised Pre-training for Autonomous Driving
Linköping University, Department of Electrical Engineering, Computer Vision. Linköping University, Faculty of Science & Engineering. Zenseact.ORCID iD: 0000-0002-0194-6346
Zenseact; Chalmers University of Technology, Gothenburg, Sweden.
Zenseact; Lund University, Lund, Sweden.
Zenseact; Chalmers University of Technology, Gothenburg, Sweden.
Show others and affiliations
(English)Manuscript (preprint) (Other academic)
Abstract [en]

Self-supervised pre-training based on next-token prediction has enabled large language models to capture the underlying structure of text, and has led to unprecedented performance on a large array of tasks when applied at scale. Similarly, autonomous driving generates vast amounts of spatiotemporal data, alluding to the possibility of harnessing scale to learn the underlying geometric and semantic structure of the environment and its evolution over time. In this direction, we propose a geometric and semantic self-supervised pre-training method, GASP, that learns a unified representation by predicting, at any queried future point in spacetime, (1) general occupancy, capturing the evolving structure of the 3D scene; (2) ego occupancy, modeling the ego vehicle path through the environment; and (3) distilled high-level features from a vision foundation model. By modeling geometric and semantic 4D occupancy fields instead of raw sensor measurements, the model learns a structured, generalizable representation of the environment and its evolution through time. We validate GASP on multiple autonomous driving benchmarks, demonstrating significant improvements in semantic occupancy forecasting, online mapping, and ego trajectory prediction. Our results demonstrate that continuous 4D geometric and semantic occupancy prediction provides a scalable and effective pre-training paradigm for autonomous driving. For code and additional visualizations, see https://research.zenseact.com/publications/gasp/

National Category
Computer graphics and computer vision Artificial Intelligence
Identifiers
URN: urn:nbn:se:liu:diva-218890DOI: 10.48550/arXiv.2503.15672OAI: oai:DiVA.org:liu-218890DiVA, id: diva2:2007146
Available from: 2025-10-17 Created: 2025-10-17 Last updated: 2025-10-27
In thesis
1. On the Road to Safe Autonomous Driving via Data, Learning, and Validation
Open this publication in new window or tab >>On the Road to Safe Autonomous Driving via Data, Learning, and Validation
2025 (English)Doctoral thesis, comprehensive summary (Other academic)
Abstract [en]

Autonomous driving systems hold the promise of safer and more efficient transportation, with the potential to fundamentally reshape what everyday mobility looks like. However, to realize these promises, such systems must perform reliably in both routine driving and in rare, safety-critical situations. To this end, this thesis addresses three core aspects of autonomous driving development: data, learning, and validation.

First, we tackle the fundamental need for high-quality data by introducing the Zenseact Open Dataset (ZOD) in Paper A. ZOD is a large-scale multimodal dataset collected across diverse geographies, weather conditions, and road types throughout Europe, effectively addressing key shortcomings of existing academic datasets.

We then turn to the challenge of learning from this data. First, we develop a method that bypasses the need for intricate image signal processing pipelines and instead learns to detect objects directly from RAW image data in a supervised setting (Paper B). This reduces the reliance on hand-crafted preprocessing but still requires annotations. Although sensor data is typically abundant in the autonomous driving setting, such annotations become prohibitively expensive at scale. To overcome this bottleneck, we propose GASP (Paper C), a self-supervised method that captures structured 4D representations by jointly modeling geometry, semantics, and dynamics solely from sensor data.

Once models are trained, they must undergo rigorous validation. Yet existing evaluation methods often fall short in realism, scalability, or both. To remedy this, we introduce NeuroNCAP (Paper D), a neural rendering-based closed-loop simulation framework that enables safety-critical testing in photorealistic environments. Building on this, we present R3D2 (Paper E), a generative method for realistic insertion of non-native 3D assets into such environments, further enhancing the scalability and diversity of safety-critical testing.

Together, these contributions provide a scalable set of tools for training and validating autonomous driving systems, supporting progress both in mastering the nominal 99% of everyday driving and in validating behavior in the critical 1% of rare, safety-critical situations.

Place, publisher, year, edition, pages
Linköping: Linköping University Electronic Press, 2025. p. 65
Series
Linköping Studies in Science and Technology. Dissertations, ISSN 0345-7524 ; 2478
National Category
Computer Vision and Learning Systems
Identifiers
urn:nbn:se:liu:diva-219102 (URN)10.3384/9789181182453 (DOI)9789181182446 (ISBN)9789181182453 (ISBN)
Public defence
2025-11-28, Zero, Zenit Building, Campus Valla, Linköping, 09:15 (English)
Opponent
Supervisors
Note

Funding agencies: This thesis work was supported by the Wallenberg Artificial Intelligence, Autonomous Systems and Software Program (WASP) funded by Knut and Alice Wallenberg Foundation, and by Zenseact AB through their industrial PhD program. The computational resources were provided by the National Academic Infrastructure for Supercomputing in Sweden (NAISS) at C3SE, partially funded by the Swedish Research Council through grant agreement no. 2022-06725, and by the Berzelius resource, providedby the Knut and Alice Wallenberg Foundation at the National Supercomputer Centre.

Available from: 2025-10-27 Created: 2025-10-27 Last updated: 2025-10-27Bibliographically approved

Open Access in DiVA

fulltext from ArXiV CC BY(8295 kB)21 downloads
File information
File name FULLTEXT01.pdfFile size 8295 kBChecksum SHA-512
b8df7f31d38907c0105df9f82f86300dbf45f7f2f02e6d7d6da3db8b7d96d27dbd97e16b2a371745126a076a1920d36bfffc7fabd306b50c062e599eaf10bed8
Type fulltextMimetype application/pdf

Other links

Publisher's full text

Authority records

Ljungbergh, WilliamFelsberg, Michael

Search in DiVA

By author/editor
Ljungbergh, WilliamFelsberg, Michael
By organisation
Computer VisionFaculty of Science & Engineering
Computer graphics and computer visionArtificial Intelligence

Search outside of DiVA

GoogleGoogle Scholar
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

doi
urn-nbn

Altmetric score

doi
urn-nbn
Total: 735 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • oxford
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf