liu.seSearch for publications in DiVA
Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • oxford
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Towards the Next Generation of 3D Reconstruction
Linköping University, Department of Electrical Engineering, Computer Vision. Linköping University, Faculty of Science & Engineering.ORCID iD: 0000-0002-1019-8634
2025 (English)Doctoral thesis, comprehensive summary (Other academic)
Abstract [en]

Humans perceive our visual surroundings through the projection of light rays through our pupils and onto the retina. Aided by motion, we gain an understanding of our environment, as well as our location within it. The goal of image-based 3D reconstruction is to imbue machines with similar capabilities. The most prominent paradigm for image-based 3D reconstruction is called Structure-from-Motion (SfM). Traditionally, SfM has been approached through handcrafted algorithms, which are brittle when assumptions do not hold. Humans, on the other hand, understand their environment intuitively and show remarkable robustness in their ability to localize themselves in, and map the world. 

The main purpose of this thesis is the development of a set of methods which strives toward the next generation of SfM, imbued with intelligence and robustness. In particular, we propose a set of methods dealing with 2D: learning of keypoint detectors, features, and dense feature matching, and 3D: threshold-robust relative pose estimation, and registration of SfM maps. 

First, we develop models to detect keypoints, producing a set of 2D image coordinates, and models to describe the image, producing features. One of our key contributions is decoupling these tasks, which have typically been learned jointly, into distinct objectives, resulting in major gains in performance, as well as increased modularity. Paper A introduces this decoupled framework, and Paper B further develops the keypoint objective. In Paper C we revisit the keypoint objective from an entirely self-supervised reinforcement learning perspective, yielding several insights, and further gains in performance. 

We further develop methods for dense feature matching, i.e., matching every pixel between two images. In Paper D we propose the first dense feature matcher capable of outperforming sparse matching for relative pose estimation. This is significant, as previous work had generally indicated that the sparse or semi-dense paradigm was preferable. In Paper E we greatly improve on almost all components of the method of Paper D, resulting in an extremely robust dense matcher, capable of matching almost any pair of images. 

We lift our eyes from the 2D image plane into 3D, and investigate relative pose estimation and 3D registration of SfM maps. Relative pose estimation is a difficult task, as non-robust estimation fails in the presence of outliers. Random Sample Consensus (RANSAC), which is the goldstandard robust estimation method, requires setting an outlier threshold, which is non-trivial to set, and poor choices result in significantly worse performance. In Paper F, we develop an algorithm to automatically estimate this threshold from an initial guess that is less biased than previous approaches, leading to robust performance. 

Finally, we investigate registering SfM maps together. This is particularly interesting in distributed settings where, e.g., robots need to localize with respect to each other’s reference frames in order to collaborate. However, in this setting, using image-based localization approaches comes with downsides. In particular, computational complexity, compatibility issues, and privacy concerns severely limit the potential of such systems to be deployed. In Paper G we propose a new paradigm for registering SfM maps through point cloud registration, circumventing the above limitations. Finding that existing registration models trained on 3D scan data fail on this task, we develop a dataset for SfM registration. Training on our proposed dataset greatly improves performance on the task, showing the potential of the proposed paradigm.

Place, publisher, year, edition, pages
Linköping: Linköping University Electronic Press, 2025. , p. 121
Series
Linköping Studies in Science and Technology. Dissertations, ISSN 0345-7524 ; 2464
National Category
Computer graphics and computer vision
Identifiers
URN: urn:nbn:se:liu:diva-217639DOI: 10.3384/9789181181906ISBN: 9789181181890 (print)ISBN: 9789181181906 (electronic)OAI: oai:DiVA.org:liu-217639DiVA, id: diva2:1997111
Public defence
2025-10-08, Zero, Hus Zenit, Campus Valla, Linköping, 09:15 (English)
Opponent
Supervisors
Funder
ELLIIT - The Linköping‐Lund Initiative on IT and Mobile CommunicationsWallenberg AI, Autonomous Systems and Software Program (WASP)Available from: 2025-09-11 Created: 2025-09-11 Last updated: 2025-09-19
List of papers
1. DeDoDe: Detect, Don’t Describe — Describe, Don’t Detect for Local Feature Matching
Open this publication in new window or tab >>DeDoDe: Detect, Don’t Describe — Describe, Don’t Detect for Local Feature Matching
2024 (English)In: 2024 International Conference on 3D Vision (3DV), Institute of Electrical and Electronics Engineers (IEEE), 2024Conference paper, Published paper (Refereed)
Abstract [en]

Keypoint detection is a pivotal step in 3D reconstruction, whereby sets of (up to) K points are detected in each view of a scene. Crucially, the detected points need to be consistent between views, i.e., correspond to the same 3D point in the scene. One of the main challenges with keypoint detection is the formulation of the learning objective. Previous learning-based methods typically jointly learn descriptors with keypoints, and treat the keypoint detection as a binary classification task on mutual nearest neighbours. However, basing keypoint detection on descriptor nearest neighbours is a proxy task, which is not guaranteed to produce 3D-consistent keypoints. Furthermore, this ties the keypoints to a specific descriptor, complicating downstream usage. In this work, we instead learn keypoints directly from 3D consistency. To this end, we train the detector to detect tracks from large-scale SfM. As these points are often overly sparse, we derive a semi-supervised two-view detection objective to expand this set to a desired number of detections. To train a descriptor, we maximize the mutual nearest neighbour objective over the keypoints with a separate network. Results show that our approach, DeDoDe, achieves significant gains on multiple geometry benchmarks. Code is provided at http://github.com/Parskatt/DeDoDegithub.com/Parskatt/DeDoDe.

Place, publisher, year, edition, pages
Institute of Electrical and Electronics Engineers (IEEE), 2024
Series
2024 International Conference on 3D Vision (3DV), ISSN 2378-3826, E-ISSN 2475-7888
National Category
Computer graphics and computer vision
Identifiers
urn:nbn:se:liu:diva-204892 (URN)10.1109/3dv62453.2024.00035 (DOI)001250581700028 ()9798350362459 (ISBN)9798350362466 (ISBN)
Conference
International Conference on 3D Imaging, Modeling, Processing, Visualization and Transmission (3DIMPVT), Davos, Switzerland, 18-21 March, 2024.
Note

Funding Agencies|Wallenberg Artificial Intelligence, Autonomous Systems and Software Program (WASP) - Knut and Alice Wallenberg Foundation; strategic research environment ELLIIT - Swedish government; Swedish Research Council [2022-06725]; Knut and Alice Wallenberg Foundation at the National Supercomputer Centre

Available from: 2024-06-17 Created: 2024-06-17 Last updated: 2025-09-11
2. DeDoDe v2: Analyzing and Improving the DeDoDe Keypoint Detector
Open this publication in new window or tab >>DeDoDe v2: Analyzing and Improving the DeDoDe Keypoint Detector
2024 (English)In: 2024 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION WORKSHOPS, CVPRW, IEEE COMPUTER SOC , 2024, p. 4245-4253Conference paper, Published paper (Refereed)
Abstract [en]

In this paper, we analyze and improve into the recently proposed DeDoDe keypoint detector. We focus our analysis on some key issues. First, we find that DeDoDe keypoints tend to cluster together, which we fix by performing non-max suppression on the target distribution of the detector during training. Second, we address issues related to data augmentation. In particular, the DeDoDe detector is sensitive to large rotations. We fix this by including 90-degree rotations as well as horizontal flips. Finally, the decoupled nature of the DeDoDe detector makes evaluation of downstream usefulness problematic. We fix this by matching the keypoints with a pretrained dense matcher (RoMa) and evaluating two-view pose estimates. We find that the original long training is detrimental to performance, and therefore propose a much shorter training schedule. We integrate all these improvements into our proposed detector DeDoDe v2 and evaluate it with the original DeDoDe descriptor on the MegaDepth-1500 and IMC2022 benchmarks. Our proposed detector significantly increases pose estimation results, notably from 75.9 to 78.3 mAA on the IMC2022 challenge. Code and weights are available at github.com/Parskatt/DeDoDe.

Place, publisher, year, edition, pages
IEEE COMPUTER SOC, 2024
Series
IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, ISSN 2160-7508, E-ISSN 2160-7516
National Category
Computer graphics and computer vision
Identifiers
urn:nbn:se:liu:diva-212421 (URN)10.1109/CVPRW63382.2024.00428 (DOI)001327781704041 ()2-s2.0-85198087533 (Scopus ID)9798350365474 (ISBN)9798350365481 (ISBN)
Conference
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, jun 16-22, 2024
Note

Funding Agencies|Wallenberg Artificial Intelligence, Autonomous Systems and Software Program (WASP) - Knut and Alice Wallenberg Foundation; strategic research environment ELLIIT - Swedish government; Swedish Research Council [2022-06725]; Knut and Alice Wallenberg Foundation at the National Supercomputer Centre

Available from: 2025-03-19 Created: 2025-03-19 Last updated: 2025-09-11
3. DaD: Distilled Reinforcement Learning for Diverse Keypoint Detection
Open this publication in new window or tab >>DaD: Distilled Reinforcement Learning for Diverse Keypoint Detection
(English)Manuscript (preprint) (Other academic)
Abstract [en]

Keypoints are what enable Structure-from-Motion (SfM) systems to scale to thousands of images. However, designing a keypoint detection objective is a non-trivial task, as SfM is non-differentiable. Typically, an auxiliary objective involving a descriptor is optimized. This however induces a dependency on the descriptor, which is undesirable. In this paper we propose a fully self-supervised and descriptor-free objective for keypoint detection, through reinforcement learning. To ensure training does not degenerate, we leverage a balanced top-K sampling strategy. While this already produces competitive models, we find that two qualitatively different types of detectors emerge, which are only able to detect light and dark keypoints respectively. To remedy this, we train a third detector, DaD, that optimizes the Kullback-Leibler divergence of the pointwise maximum of both light and dark detectors. Our approach significantly improve upon SotA across a range of benchmarks. Code and model weights are publicly available at https://github.com/parskatt/dad

National Category
Computer graphics and computer vision
Identifiers
urn:nbn:se:liu:diva-217642 (URN)10.48550/arXiv.2503.07347 (DOI)
Available from: 2025-09-11 Created: 2025-09-11 Last updated: 2025-09-11
4. DKM: Dense Kernelized Feature Matching for Geometry Estimation
Open this publication in new window or tab >>DKM: Dense Kernelized Feature Matching for Geometry Estimation
2023 (English)In: 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), IEEE Communications Society, 2023, p. 17765-17775Conference paper, Published paper (Refereed)
Abstract [en]

Feature matching is a challenging computer vision task that involves finding correspondences between two images of a 3D scene. In this paper we consider the dense approach instead of the more common sparse paradigm, thus striving to find all correspondences. Perhaps counter-intuitively, dense methods have previously shown inferior performance to their sparse and semi-sparse counterparts for estimation of two-view geometry. This changes with our novel dense method, which outperforms both dense and sparse methods on geometry estimation. The novelty is threefold: First, we propose a kernel regression global matcher. Secondly, we propose warp refinement through stacked feature maps and depthwise convolution kernels. Thirdly, we propose learning dense confidence through consistent depth and a balanced sampling approach for dense confidence maps. Through extensive experiments we confirm that our proposed dense method, Dense Kernelized Feature Matching, sets a new state-of-the-art on multiple geometry estimation benchmarks. In particular, we achieve an improvement on MegaDepth-1500 of +4.9 and +8.9 AUC@5° compared to the best previous sparse method and dense method respectively. Our code is provided at the following repository: https://github.com/Parskatt/DKM.

Place, publisher, year, edition, pages
IEEE Communications Society, 2023
Series
Proceedings:IEEE Conference on Computer Vision and Pattern Recognition, ISSN 1063-6919, E-ISSN 2575-7075
National Category
Computer graphics and computer vision
Identifiers
urn:nbn:se:liu:diva-197717 (URN)10.1109/cvpr52729.2023.01704 (DOI)001062531302008 ()9798350301298 (ISBN)9798350301304 (ISBN)
Conference
2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17-24 June 2023
Note

This work was supported by the Wallenberg Artificial Intelligence, Autonomous Systems and Software Program (WASP), funded by Knut and Alice Wallenberg Foundation; andby the strategic research environment ELLIIT funded by the Swedish government. The computational resources were provided by the National Academic Infrastructure forSupercomputing in Sweden (NAISS), partially funded by the Swedish Research Council through grant agreement no. 2022-06725, and by the Berzelius resource, provided bythe Knut and Alice Wallenberg Foundation at the National Supercomputer Centre.

Available from: 2023-09-11 Created: 2023-09-11 Last updated: 2025-09-11Bibliographically approved
5. RoMa: Robust Dense Feature Matching
Open this publication in new window or tab >>RoMa: Robust Dense Feature Matching
Show others...
2024 (English)In: 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Institute of Electrical and Electronics Engineers (IEEE), 2024, p. 19790-19800Conference paper, Published paper (Refereed)
Abstract [en]

Feature matching is an important computer vision task that involves estimating correspondences between two images of a 3D scene, and dense methods estimate all such correspondences. The aim is to learn a robust model, i.e., a model able to match under challenging real-world changes. In this work, we propose such a model, leveraging frozen pretrained features from the foundation model DINOv2. Al-though these features are significantly more robust than local features trained from scratch, they are inherently coarse. We therefore combine them with specialized ConvNet fine features, creating a precisely localizable feature pyramid. To further improve robustness, we propose a tailored transformer match decoder that predicts anchor probabilities, which enables it to express multimodality. Finally, we propose an improved loss formulation through regression-by-classification with subsequent robust regression. We conduct a comprehensive set of experiments that show that our method, RoMa, achieves significant gains, setting a new state-of-the-art. In particular, we achieve a 36% improvement on the extremely challenging WxBS benchmark. Code is provided at github.com/Parskatt/RoMa.

Place, publisher, year, edition, pages
Institute of Electrical and Electronics Engineers (IEEE), 2024
Series
Conference on Computer Vision and Pattern Recognition (CVPR), ISSN 1063-6919, E-ISSN 2575-7075
National Category
Computer graphics and computer vision
Identifiers
urn:nbn:se:liu:diva-207702 (URN)10.1109/CVPR52733.2024.01871 (DOI)001342515503014 ()2-s2.0-85199525100 (Scopus ID)9798350353006 (ISBN)9798350353013 (ISBN)
Conference
2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Seattle, WA, USA, 16-22 June 2024.
Note

Funding Agencies|Wallenberg Artificial Intelligence, Autonomous Systems and Software Program (WASP) - Knut and Alice Wallenberg Foundation; strategic research environment ELLIIT - Swedish government; Swedish Research Council [2022-06725]; Knut and Alice Wallenberg Foundation at the National Supercomputer Centre

Available from: 2024-09-17 Created: 2024-09-17 Last updated: 2025-09-11
6. Less Biased Noise Scale Estimation for Threshold-Robust RANSAC
Open this publication in new window or tab >>Less Biased Noise Scale Estimation for Threshold-Robust RANSAC
(English)Manuscript (preprint) (Other academic)
Abstract [en]

The gold-standard for robustly estimating relative pose through image matching is RANSAC. While RANSAC is powerful, it requires setting the inlier threshold that determines whether the error of a correspondence under an estimated model is sufficiently small to be included in its consensus set. Setting this threshold is typically done by hand, and is difficult to tune without an access to ground truth data. Thus, a method capable of automatically determining the optimal threshold would be desirable. In this paper we revisit inlier noise scale estimation, which is an attractive approach as the inlier noise scale is linear to the optimal threshold. We revisit the noise scale estimation method SIMFIT and find bias in the estimate of the noise scale. In particular, we fix underestimates from using the same data for fitting the model as estimating the inlier noise, and from not taking the threshold itself into account. Secondly, since the optimal threshold within a scene is approximately constant we propose a multi-pair extension of SIMFIT++, by filtering of estimates, which improves results. Our approach yields robust performance across a range of thresholds, shown in Figure 1. Code is available at https://github.com/Parskatt/simfitpp

National Category
Computer graphics and computer vision
Identifiers
urn:nbn:se:liu:diva-217640 (URN)10.48550/arXiv.2503.13433 (DOI)
Available from: 2025-09-11 Created: 2025-09-11 Last updated: 2025-09-11
7. ColabSfM: Collaborative Structure-from-Motion by Point Cloud Registration
Open this publication in new window or tab >>ColabSfM: Collaborative Structure-from-Motion by Point Cloud Registration
2025 (English)In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025, p. 6573-6583Conference paper, Published paper (Refereed)
Abstract [en]

Structure-from-Motion (SfM) is the task of estimating 3D structure and camera poses from images. We define Collaborative SfM (ColabSfM) as sharing distributed SfM reconstructions. Sharing maps requires estimating a joint reference frame, which is typically referred to as registration. However, there is a lack of scalable methods and training datasets for registering SfM reconstructions. In this paper, we tackle this challenge by proposing the scalable task of point cloud registration for SfM reconstructions. We find that current registration methods cannot register SfM point clouds when trained on existing datasets. To this end, we propose a SfM registration dataset generation pipeline, leveraging partial reconstructions from synthetically generated camera trajectories for each scene. Finally, we propose a simple but impactful neural refiner on top of the SotA registration method RoITr that yields significant improvements, which we call RefineRoITr. Our extensive experimental evaluation shows that our proposed pipeline and model enables ColabSfM. Code is available at https://github.com/EricssonResearch/ColabSfM.

National Category
Computer graphics and computer vision
Identifiers
urn:nbn:se:liu:diva-217674 (URN)
Conference
CVPR 2025, Nashville Tennessee
Available from: 2025-09-12 Created: 2025-09-12 Last updated: 2025-09-12

Open Access in DiVA

fulltext(163738 kB)616 downloads
File information
File name FULLTEXT01.pdfFile size 163738 kBChecksum SHA-512
8b3c8e4615c453cda30a28f1e172c5418708117a8a21a07ed26110949cb8028c2ec2a18218233ae0b35bd6a29ef7e781a9d405373c6f882d5a565b89e6c42c72
Type fulltextMimetype application/pdf
Order online >>

Other links

Publisher's full text

Authority records

Edstedt, Johan

Search in DiVA

By author/editor
Edstedt, Johan
By organisation
Computer VisionFaculty of Science & Engineering
Computer graphics and computer vision

Search outside of DiVA

GoogleGoogle Scholar
Total: 616 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

doi
isbn
urn-nbn

Altmetric score

doi
isbn
urn-nbn
Total: 4283 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • oxford
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf