liu.seSök publikationer i DiVA
Ändra sökning
RefereraExporteraLänk till posten
Permanent länk

Direktlänk
Referera
Referensformat
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • oxford
  • Annat format
Fler format
Språk
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Annat språk
Fler språk
Utmatningsformat
  • html
  • text
  • asciidoc
  • rtf
GeoChat : Grounded Large Vision-Language Model for Remote Sensing
Mohamed bin Zayed Univ AI, U Arab Emirates; Birla Inst Technol & Sci, India.
Mohamed bin Zayed Univ AI, U Arab Emirates.
Mohamed bin Zayed Univ AI, U Arab Emirates.
Birla Inst Technol & Sci, India.
Visa övriga samt affilieringar
2024 (Engelska)Ingår i: 2024 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), IEEE COMPUTER SOC , 2024, s. 27831-27840Konferensbidrag, Publicerat paper (Refereegranskat)
Abstract [en]

Recent advancements in Large Vision-Language Models (VLMs) have shown great promise in natural image domains, allowing users to hold a dialogue about given visual content. However, such general-domain VLMs perform poorly for Remote Sensing (RS) scenarios, leading to inaccurate or fabricated information when presented with RS domain-specific queries. Such a behavior emerges due to the unique challenges introduced by RS imagery. For example, to handle high-resolution RS imagery with diverse scale changes across categories and many small objects, region-level reasoning is necessary alongside holistic scene interpretation. Furthermore, the lack of domain-specific multimodal instruction following data as well as strong backbone models for RS make it hard for the models to align their behavior with user queries. To address these limitations, we propose GeoChat - the first versatile remote sensing VLM that offers multitask conversational capabilities with high-resolution RS images. Specifically, GeoChat can not only answer image-level queries but also accepts region inputs to hold region-specific dialogue. Furthermore, it can visually ground objects in its responses by referring to their spatial coordinates. To address the lack of domain-specific datasets, we generate a novel RS multi-modal instruction-following dataset by extending image-text pairs from existing diverse RS datasets. We establish a comprehensive benchmark for RS multitask conversations and compare with a number of baseline methods. GeoChat demonstrates robust zero-shot performance on various RS tasks, e.g., image and region captioning, visual question answering, scene classification, visually grounded conversations and referring detection. Our code is available here.

Ort, förlag, år, upplaga, sidor
IEEE COMPUTER SOC , 2024. s. 27831-27840
Serie
IEEE Conference on Computer Vision and Pattern Recognition, ISSN 1063-6919, E-ISSN 2575-7075
Nationell ämneskategori
Datorgrafik och datorseende
Identifikatorer
URN: urn:nbn:se:liu:diva-212429DOI: 10.1109/CVPR52733.2024.02629ISI: 001344387504020Scopus ID: 2-s2.0-85196903910ISBN: 9798350353006 (digital)ISBN: 9798350353013 (tryckt)OAI: oai:DiVA.org:liu-212429DiVA, id: diva2:1945908
Konferens
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, jun 16-22, 2024
Tillgänglig från: 2025-03-19 Skapad: 2025-03-19 Senast uppdaterad: 2025-03-19

Open Access i DiVA

Fulltext saknas i DiVA

Övriga länkar

Förlagets fulltextScopus

Sök vidare i DiVA

Av författaren/redaktören
Khan, Fahad
Av organisationen
DatorseendeTekniska fakulteten
Datorgrafik och datorseende

Sök vidare utanför DiVA

GoogleGoogle Scholar

doi
isbn
urn-nbn

Altmetricpoäng

doi
isbn
urn-nbn
Totalt: 42 träffar
RefereraExporteraLänk till posten
Permanent länk

Direktlänk
Referera
Referensformat
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • oxford
  • Annat format
Fler format
Språk
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Annat språk
Fler språk
Utmatningsformat
  • html
  • text
  • asciidoc
  • rtf