liu.seSearch for publications in DiVA
Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • oxford
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Spatiotemporal Learning for Motion Estimation and Visual Recognition
Linköping University, Department of Electrical Engineering, Computer Vision. Linköping University, Faculty of Science & Engineering.ORCID iD: 0000-0001-8761-4715
2025 (English)Doctoral thesis, comprehensive summary (Other academic)
Abstract [en]

The field of computer vision has undergone rapid development. Starting from recognition tasks such as classification, detection, and segmentation, the focus of visual analysis has gradually shifted towards learning spatiotemporal information. This thesis presents research on spatiotemporal learning, with a particular emphasis on motion estimation and visual recognition.

First, we address the problem of video object tracking. Previous methods have primarily re-lied on learning improved appearance representations, while the spatiotemporal relationships of individual objects have been underexplored. We propose leveraging optical flow features to achieve higher generalization in semi-supervised video object segmentation, directly incorporating these features into both the target representation and the decoder network. Our experiments and analysis show that enriching feature representations with spatiotemporal information improves segmentation quality and generalization capability.

Next, we investigate spatiotemporal learning in 3D for motion estimation, specifically scene flow estimation. Scene flow estimation as an important research topic in 3D computer vision is crucial for applications such as robotics, autonomous driving, embodied navigation, and tracking. We investigate the problem in different perspectives: 1. What is the best formulation for solving the problem and how to learn a better spatiotemporal feature representation? 2. Can we introduce uncertainty estimation to the task, which is of crucial importance for safety-critical downstream tasks? 3. How to scale the estimation to large-scale data, e.g., autonomous scenes, and leverage the temporal information without introducing much computation overheads? To answer these questions, we explore the use of transformers for improved feature representation, diffusion models for uncertainty estimation, and efficient feature learning methods for multi-frame, large-scale autonomous driving scenarios.

Finally, we extend our research to joint visual segmentation, tracking, and open-vocabulary recognition in LiDAR sequences, particularly for autonomous scenes. In such environments, precise segmentation, tracking, and recognition of objects are essential for downstream analysis and control. Current human-annotated open-source datasets allow for reasonable tracking of traffic participants such as cars and pedestrians. However, we aim to advance beyond this towards segmenting and tracking any object in LiDAR data. To this end, we propose a pseudo-labeling engine that leverages the 2D visual foundation model SAM v2 and the vision-language model CLIP to automatically label LiDAR streams. We further introduce the SAL-4D model, capable of segmenting, tracking, and recognizing any object in a zero-shot manner.

In summary, we explore the learning of spatiotemporal information in both 2D image and 3D point cloud domains. In the image domain, we demonstrate that spatiotemporal information improves video object segmentation quality and generalization. In the 3D point cloud domain, we show that spatiotemporal learning enables more accurate motion estimation and facilitates the first method for zero-shot segmentation, tracking, and open-vocabulary recognition of arbitrary objects.

Abstract [sv]

Forskningen inom datorseende har genomgått en snabb utveckling. Från att ha fokuserat på igenkänningsuppgifter som klassificering, detektion och segmentering, har trenden inom visuell analys gradvis skiftat mot att lära sig spatiotemporal information. Denna avhandling presenterar arbeten med fokus på spatiotemporalt lärande, särskilt för rörelseskattning och visuell igenkänning.

Först behandlar vi problemet med objektspårning i video. Tidigare metoder för objektspårning har i stor utsträckning förlitat sig på att lära sig bättre representationsmodeller för objektens utseende, medan de spatiotemporala relationerna för varje individuellt objekt har varit mindre utforskade. Vi föreslår att använda optiska flödesegenskaper för att uppnå bättre generaliseringsförmåga i semi-superviserad videoobjektsegmentering. Vi föreslår att direkt använda de optiska flödesegenskaperna i målets representation. Våra experiment och analyser visar att ett rikare funktionsuttryck med spatiotemporal information förbättrar både segmenteringskvalitet och igenkänningsprecision.

Därefter undersöker vi spatiotemporalt lärande i 3D för rörelseskattning, det vill säga scenflödesuppskattning. Scenflödesuppskattning är ett viktigt forskningsområde inom 3D-datorseende och har tillämpningar inom bland annat robotik, autonom körning, navigering i miljöer samt spårning. Vi angriper problemet från olika perspektiv: 1. Vad är den bästa formuleringen för att lösa problemet och hur kan vi lära oss en bättre spatiotemporal funktionsrepresentation? 2. Kan vi införa osäkerhetsuppskattning i uppgiften, vilket är avgörande för säkerhetskritiska tillämpningar? 3. Hur kan vi skala upp beräkningen till stora datamängder, t.ex. autonoma scener, och samtidigt utnyttja temporal information utan att öka beräkningskostnaden för mycket? För att besvara dessa frågor undersöker vi användningen av transformatorer för bättre funktionsrepresentation, diffusionmodeller för osäkerhetsuppskattning samt mer effektiva metoder för funktionsinlärning för att skala upp till flerbildsscener i autonom körning.

Slutligen tar vi ett steg längre och kombinerar visuell segmentering, spårning och öppen vokabulärigenkänning i Lidar-sekvenser, särskilt inom autonoma scenarier. I autonoma körmiljöer är det mycket viktigt att segmentera, spåra och känna igen varje objekt. Med dagens mänskliga annoteringar är det möjligt att spåra trafikanter såsom bilar och fotgängare ganska väl. Men vi vill ta detta ett steg längre: mot att segmentera och spåra vad som helst i Lidar. För detta ändamål föreslår vi en pseudoetiketteringsmotor som använder 2D-visionsmodellen SAM och bildspråksmodellen CLIP för att automatiskt märka Lidar-strömmen. Vi föreslår dessutom modellen SAL-4D för att segmentera, spåra och känna igen objekt i ett zero-shot-sammanhang.

Sammanfattningsvis undersöker vi inlärning av spatiotemporal information i både 2D-bild-och 3D-punktmolnsdomäner. Utifrån bilddomänen visar vi att spatiotemporal information förbättrar kvaliteten på videoobjektsegmentering samt generaliseringsförmågan. I 3D-punktmolnsdomänen visar vi att spatiotemporalt lärande ger mer exakt rörelseskattning och möjliggör den första metoden för zero-shot segmentering, spårning och öppen vokabulärigenkänning av godtyckliga objekt.

Place, publisher, year, edition, pages
Linköping: Linköping University Electronic Press, 2025. , p. 67
Series
Linköping Studies in Science and Technology. Dissertations, ISSN 0345-7524 ; 2476
National Category
Computer graphics and computer vision
Identifiers
URN: urn:nbn:se:liu:diva-217740DOI: 10.3384/9789181182323ISBN: 9789181182316 (print)ISBN: 9789181182323 (electronic)OAI: oai:DiVA.org:liu-217740DiVA, id: diva2:1997857
Public defence
2025-10-13, Ada Lovelace, B-building, Campus Valla, Linköping, 10:15 (English)
Opponent
Supervisors
Note

Funding: I would like to acknowledge the Wallenberg AI, Autonomous Systems and Software Program (WASP) for funding my PhD studies. I am also grateful for the computational resources pro-vided by the National Academic Infrastructure for Supercomputing in Sweden (NAISS) at C3SE, partially funded by the Swedish Research Council (grant 2022-06725), as well as the Berzelius re-source, supported by the Knut and Alice Wallenberg Foundation at the National Supercomputer Center.

Available from: 2025-09-15 Created: 2025-09-15 Last updated: 2025-09-15Bibliographically approved
List of papers
1. Leveraging Optical Flow Features for Higher Generalization Power in Video Object Segmentation
Open this publication in new window or tab >>Leveraging Optical Flow Features for Higher Generalization Power in Video Object Segmentation
2023 (English)In: 2023 IEEEInternational Conferenceon Image Processing: Proceedings, IEEE , 2023, p. 326-330Conference paper, Published paper (Refereed)
Abstract [en]

We propose to leverage optical flow features for higher generalization power in semi-supervised video object segmentation. Optical flow is usually exploited as additional guidance information in many computer vision tasks. However, its relevance in video object segmentation was mainly in unsupervised settings or using the optical flow to warp or refine the previously predicted masks. Different from the latter, we propose to directly leverage the optical flow features in the target representation. We show that this enriched representation improves the encoder-decoder approach to the segmentation task. A model to extract the combined information from the optical flow and the image is proposed, which is then used as input to the target model and the decoder network. Unlike previous methods, e.g. in tracking where concatenation is used to integrate information from image data and optical flow, a simple yet effective attention mechanism is exploited in our work. Experiments on DAVIS 2017 and YouTube-VOS 2019 show that integrating the information extracted from optical flow into the original image branch results in a strong performance gain, especially in unseen classes which demonstrates its higher generalization power.

Place, publisher, year, edition, pages
IEEE, 2023
Keywords
Optical flow features; Attention mechanism; Semi-supervised VOS; Generalization power
National Category
Electrical Engineering, Electronic Engineering, Information Engineering
Identifiers
urn:nbn:se:liu:diva-199057 (URN)10.1109/ICIP49359.2023.10222542 (DOI)001106821000063 ()9781728198354 (ISBN)9781728198361 (ISBN)
Conference
2023 IEEE International Conference on Image Processing (ICIP), 8–11 October 2023 Kuala Lumpur, Malaysia
Available from: 2023-11-08 Created: 2023-11-08 Last updated: 2025-09-15
2. GMSF: Global Matching Scene Flow
Open this publication in new window or tab >>GMSF: Global Matching Scene Flow
Show others...
2023 (English)In: ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), NEURAL INFORMATION PROCESSING SYSTEMS (NIPS) , 2023Conference paper, Published paper (Refereed)
Abstract [en]

We tackle the task of scene flow estimation from point clouds. Given a source and a target point cloud, the objective is to estimate a translation from each point in the source point cloud to the target, resulting in a 3D motion vector field. Previous dominant scene flow estimation methods require complicated coarse-to-fine or recurrent architectures as a multi-stage refinement. In contrast, we propose a significantly simpler single-scale one-shot global matching to address the problem. Our key finding is that reliable feature similarity between point pairs is essential and sufficient to estimate accurate scene flow. We thus propose to decompose the feature extraction step via a hybrid local-global-cross transformer architecture which is crucial to accurate and robust feature representations. Extensive experiments show that the proposed Global Matching Scene Flow (GMSF) sets a new state-of-the-art on multiple scene flow estimation benchmarks. On FlyingThings3D, with the presence of occlusion points, GMSF reduces the outlier percentage from the previous best performance of 27.4% to 5.6%. On KITTI Scene Flow, without any fine-tuning, our proposed method shows state-of-the-art performance. On the Waymo-Open dataset, the proposed method outperforms previous methods by a large margin. The code is available at https://github.com/ZhangYushan3/GMSF.

Place, publisher, year, edition, pages
NEURAL INFORMATION PROCESSING SYSTEMS (NIPS), 2023
Series
Advances in Neural Information Processing Systems, ISSN 1049-5258
National Category
Probability Theory and Statistics
Identifiers
urn:nbn:se:liu:diva-207659 (URN)001224281502037 ()
Conference
37th Conference on Neural Information Processing Systems (NeurIPS), New Orleans, LA, dec 10-16, 2023
Note

Funding Agencies|Wallenberg Artificial Intelligence, Autonomous Systems and Software Program (WASP) - Knut and AliceWallenberg Foundation; Swedish Research Council [2022-06725]; strategic research environment ELLIIT - Swedish government

Available from: 2024-09-17 Created: 2024-09-17 Last updated: 2025-09-15
3. DiffSF: Diffusion Models for Scene Flow Estimation
Open this publication in new window or tab >>DiffSF: Diffusion Models for Scene Flow Estimation
2024 (English)In: Advances in Neural Information Processing Systems / [ed] A. Globerson and L. Mackey and D. Belgrave and A. Fan and U. Paquet and J. Tomczak and C. Zhang, Neural Information Processing Systems , 2024, Vol. 37, p. 111227-111247Conference paper, Published paper (Refereed)
Abstract [en]

Scene flow estimation is an essential ingredient for a variety of real-world applications, especially for autonomous agents, such as self-driving cars and robots. While recent scene flow estimation approaches achieve reasonable accuracy, their applicability to real-world systems additionally benefits from a reliability measure. Aiming at improving accuracy while additionally providing an estimate for uncertainty, we propose DiffSF that combines transformer-based scene flow estimation with denoising diffusion models. In the diffusion process, the ground truth scene flow vector field is gradually perturbed by adding Gaussian noise. In the reverse process, starting from randomly sampled Gaussian noise, the scene flow vector field prediction is recovered by conditioning on a source and a target point cloud. We show that the diffusion process greatly increases the robustness of predictions compared to prior approaches, resulting in state-of-the-art performance on standard scene flow estimation benchmarks. Moreover, by sampling multiple times with different initial states, the denoising process predicts multiple hypotheses, which enables measuring the output uncertainty, allowing our approach to detect a majority of the inaccurate predictions. The code is available athttps://github.com/ZhangYushan3/DiffSF.

Place, publisher, year, edition, pages
Neural Information Processing Systems, 2024
National Category
Fluid Mechanics
Identifiers
urn:nbn:se:liu:diva-216959 (URN)9798331314385 (ISBN)
Conference
38th Conference on Neural Information Processing Systems (NeurIPS 2024)
Available from: 2025-08-26 Created: 2025-08-26 Last updated: 2025-09-15

Open Access in DiVA

fulltext(2066 kB)170 downloads
File information
File name FULLTEXT01.pdfFile size 2066 kBChecksum SHA-512
d6319f50b5db9069906ad9a9c4938c7583b8d52c05efa6c58fdcd7ccb305aaf104c0d1c72ed92da0dad2b13f8c5d422609de42fc605db76cdd63a039677c8234
Type fulltextMimetype application/pdf
Order online >>

Other links

Publisher's full text

Authority records

Zhang, Yushan

Search in DiVA

By author/editor
Zhang, Yushan
By organisation
Computer VisionFaculty of Science & Engineering
Computer graphics and computer vision

Search outside of DiVA

GoogleGoogle Scholar
Total: 170 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

doi
isbn
urn-nbn

Altmetric score

doi
isbn
urn-nbn
Total: 1930 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • oxford
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf