liu.seSearch for publications in DiVA
Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • oxford
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
DeDoDe: Detect, Don’t Describe — Describe, Don’t Detect for Local Feature Matching
Linköping University, Department of Electrical Engineering, Computer Vision. Linköping University, Faculty of Science & Engineering.ORCID iD: 0000-0002-1019-8634
Chalmers University of Technology.
Linköping University, Department of Electrical Engineering, Computer Vision. Linköping University, Faculty of Science & Engineering.ORCID iD: 0000-0002-0675-2794
Linköping University, Department of Electrical Engineering, Computer Vision. Linköping University, Faculty of Science & Engineering.ORCID iD: 0000-0002-6096-3648
2024 (English)In: 2024 International Conference on 3D Vision (3DV), Institute of Electrical and Electronics Engineers (IEEE), 2024Conference paper, Published paper (Refereed)
Abstract [en]

Keypoint detection is a pivotal step in 3D reconstruction, whereby sets of (up to) K points are detected in each view of a scene. Crucially, the detected points need to be consistent between views, i.e., correspond to the same 3D point in the scene. One of the main challenges with keypoint detection is the formulation of the learning objective. Previous learning-based methods typically jointly learn descriptors with keypoints, and treat the keypoint detection as a binary classification task on mutual nearest neighbours. However, basing keypoint detection on descriptor nearest neighbours is a proxy task, which is not guaranteed to produce 3D-consistent keypoints. Furthermore, this ties the keypoints to a specific descriptor, complicating downstream usage. In this work, we instead learn keypoints directly from 3D consistency. To this end, we train the detector to detect tracks from large-scale SfM. As these points are often overly sparse, we derive a semi-supervised two-view detection objective to expand this set to a desired number of detections. To train a descriptor, we maximize the mutual nearest neighbour objective over the keypoints with a separate network. Results show that our approach, DeDoDe, achieves significant gains on multiple geometry benchmarks. Code is provided at http://github.com/Parskatt/DeDoDegithub.com/Parskatt/DeDoDe.

Place, publisher, year, edition, pages
Institute of Electrical and Electronics Engineers (IEEE), 2024.
Series
2024 International Conference on 3D Vision (3DV), ISSN 2378-3826, E-ISSN 2475-7888
National Category
Computer graphics and computer vision
Identifiers
URN: urn:nbn:se:liu:diva-204892DOI: 10.1109/3dv62453.2024.00035ISI: 001250581700028ISBN: 9798350362459 (electronic)ISBN: 9798350362466 (print)OAI: oai:DiVA.org:liu-204892DiVA, id: diva2:1871265
Conference
International Conference on 3D Imaging, Modeling, Processing, Visualization and Transmission (3DIMPVT), Davos, Switzerland, 18-21 March, 2024.
Note

Funding Agencies|Wallenberg Artificial Intelligence, Autonomous Systems and Software Program (WASP) - Knut and Alice Wallenberg Foundation; strategic research environment ELLIIT - Swedish government; Swedish Research Council [2022-06725]; Knut and Alice Wallenberg Foundation at the National Supercomputer Centre

Available from: 2024-06-17 Created: 2024-06-17 Last updated: 2025-09-11
In thesis
1. Towards the Next Generation of 3D Reconstruction
Open this publication in new window or tab >>Towards the Next Generation of 3D Reconstruction
2025 (English)Doctoral thesis, comprehensive summary (Other academic)
Abstract [en]

Humans perceive our visual surroundings through the projection of light rays through our pupils and onto the retina. Aided by motion, we gain an understanding of our environment, as well as our location within it. The goal of image-based 3D reconstruction is to imbue machines with similar capabilities. The most prominent paradigm for image-based 3D reconstruction is called Structure-from-Motion (SfM). Traditionally, SfM has been approached through handcrafted algorithms, which are brittle when assumptions do not hold. Humans, on the other hand, understand their environment intuitively and show remarkable robustness in their ability to localize themselves in, and map the world. 

The main purpose of this thesis is the development of a set of methods which strives toward the next generation of SfM, imbued with intelligence and robustness. In particular, we propose a set of methods dealing with 2D: learning of keypoint detectors, features, and dense feature matching, and 3D: threshold-robust relative pose estimation, and registration of SfM maps. 

First, we develop models to detect keypoints, producing a set of 2D image coordinates, and models to describe the image, producing features. One of our key contributions is decoupling these tasks, which have typically been learned jointly, into distinct objectives, resulting in major gains in performance, as well as increased modularity. Paper A introduces this decoupled framework, and Paper B further develops the keypoint objective. In Paper C we revisit the keypoint objective from an entirely self-supervised reinforcement learning perspective, yielding several insights, and further gains in performance. 

We further develop methods for dense feature matching, i.e., matching every pixel between two images. In Paper D we propose the first dense feature matcher capable of outperforming sparse matching for relative pose estimation. This is significant, as previous work had generally indicated that the sparse or semi-dense paradigm was preferable. In Paper E we greatly improve on almost all components of the method of Paper D, resulting in an extremely robust dense matcher, capable of matching almost any pair of images. 

We lift our eyes from the 2D image plane into 3D, and investigate relative pose estimation and 3D registration of SfM maps. Relative pose estimation is a difficult task, as non-robust estimation fails in the presence of outliers. Random Sample Consensus (RANSAC), which is the goldstandard robust estimation method, requires setting an outlier threshold, which is non-trivial to set, and poor choices result in significantly worse performance. In Paper F, we develop an algorithm to automatically estimate this threshold from an initial guess that is less biased than previous approaches, leading to robust performance. 

Finally, we investigate registering SfM maps together. This is particularly interesting in distributed settings where, e.g., robots need to localize with respect to each other’s reference frames in order to collaborate. However, in this setting, using image-based localization approaches comes with downsides. In particular, computational complexity, compatibility issues, and privacy concerns severely limit the potential of such systems to be deployed. In Paper G we propose a new paradigm for registering SfM maps through point cloud registration, circumventing the above limitations. Finding that existing registration models trained on 3D scan data fail on this task, we develop a dataset for SfM registration. Training on our proposed dataset greatly improves performance on the task, showing the potential of the proposed paradigm.

Place, publisher, year, edition, pages
Linköping: Linköping University Electronic Press, 2025. p. 121
Series
Linköping Studies in Science and Technology. Dissertations, ISSN 0345-7524 ; 2464
National Category
Computer graphics and computer vision
Identifiers
urn:nbn:se:liu:diva-217639 (URN)10.3384/9789181181906 (DOI)9789181181890 (ISBN)9789181181906 (ISBN)
Public defence
2025-10-08, Zero, Hus Zenit, Campus Valla, Linköping, 09:15 (English)
Opponent
Supervisors
Funder
ELLIIT - The Linköping‐Lund Initiative on IT and Mobile CommunicationsWallenberg AI, Autonomous Systems and Software Program (WASP)
Available from: 2025-09-11 Created: 2025-09-11 Last updated: 2025-09-19

Open Access in DiVA

No full text in DiVA

Other links

Publisher's full textarXiv

Authority records

Edstedt, JohanWadenbäck, MårtenFelsberg, Michael

Search in DiVA

By author/editor
Edstedt, JohanWadenbäck, MårtenFelsberg, Michael
By organisation
Computer VisionFaculty of Science & Engineering
Computer graphics and computer vision

Search outside of DiVA

GoogleGoogle Scholar

doi
isbn
urn-nbn

Altmetric score

doi
isbn
urn-nbn
Total: 186 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • oxford
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf