Remote Sensing 3D SceneRetrieval: Multi-modal Alignment of Text, Images,and Digital Elevation Models
2025 (English)Independent thesis Advanced level (degree of Master (Two Years)), 28 HE credits
Student thesis
Abstract [en]
Multi-modal retrieval has traditionally focused on combining diverse query in-puts, such as text and sketches, in remote sensing and computer vision. However, retrieval involving multi-modal target representations, such as paired rgband depth data, remains largely unaddressed. This work investigates whetherincorporating a depth modality can improve the performance of vision-languagemodels in the context of satellite image retrieval. To explore this, a novel dataset, rsitdd, was constructed, combining orthophotos and digital height models, andused to train a clip-based remote sensing depth encoder. Experimental results show that models augmented with a depth encoder outperform their text-image-only counterparts across multiple benchmark settings. These findings highlight the potential of depth-enhanced models for remote sensing applications and demonstrate that even simple fusion techniques can yield measurable performance improvements.
Place, publisher, year, edition, pages
2025. , p. 62
Keywords [en]
multi-modal retrieval, remote sensing retrieval, dual-encoder
National Category
Computer graphics and computer vision
Identifiers
URN: urn:nbn:se:liu:diva-219595ISRN: LiTH-ISY-EX--25/5794--SEOAI: oai:DiVA.org:liu-219595DiVA, id: diva2:2015058
External cooperation
Maxar
Subject / course
Computer Engineering
Presentation
2025-08-28, Linköping, 09:00 (English)
Supervisors
Examiners
2025-11-212025-11-202025-11-21Bibliographically approved