liu.seSearch for publications in DiVA
Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • oxford
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Foundation Models Defining a New Era in Vision: A Survey and Outlook
MBZ Univ AI, U Arab Emirates; Georgia Inst Technol, GA 30332 USA.
Khalifa Univ, U Arab Emirates; Khalifa Univ, U Arab Emirates; Australian Natl Univ, Australia.
MBZ Univ AI, U Arab Emirates; Australian Natl Univ, Australia.
MBZ Univ AI, U Arab Emirates.
Show others and affiliations
2025 (English)In: IEEE Transactions on Pattern Analysis and Machine Intelligence, ISSN 0162-8828, E-ISSN 1939-3539, Vol. 47, no 4, p. 2245-2264Article in journal (Refereed) Published
Abstract [en]

Vision systems that see and reason about the compositional nature of visual scenes are fundamental to understanding our world. The complex relations between objects and their locations, ambiguities, and variations in the real-world environment can be better described in human language, naturally governed by grammatical rules and other modalities such as audio and depth. The models learned to bridge the gap between such modalities and large-scale training data facilitate contextual reasoning, generalization, and prompt capabilities at test time. These models are referred to as foundation models. The output of such models can be modified through human-provided prompts without retraining, e.g., segmenting a particular object by providing a bounding box, having interactive dialogues by asking questions about an image or video scene or manipulating the robot's behavior through language instructions. In this survey, we provide a comprehensive review of such emerging foundation models, including typical architecture designs to combine different modalities (vision, text, audio, etc.), training objectives (contrastive, generative), pre-training datasets, fine-tuning mechanisms, and the common prompting patterns; textual, visual, and heterogeneous. We discuss the open challenges and research directions for foundation models in computer vision, including difficulties in their evaluations and benchmarking, gaps in their real-world understanding, limitations of contextual understanding, biases, vulnerability to adversarial attacks, and interpretability issues. We review recent developments in this field, covering a wide range of applications of foundation models systematically and comprehensively.

Place, publisher, year, edition, pages
IEEE COMPUTER SOC , 2025. Vol. 47, no 4, p. 2245-2264
Keywords [en]
Adaptation models; Computational modeling; Foundation models; Data models; Surveys; Visualization; Reviews; Computer vision; Computer architecture; Context modeling; Contrastive learning; language and vision; large language models; masked modeling; self-supervised learning
National Category
Computer graphics and computer vision
Identifiers
URN: urn:nbn:se:liu:diva-212285DOI: 10.1109/TPAMI.2024.3506283ISI: 001439648900002PubMedID: 40030979Scopus ID: 2-s2.0-85215321762OAI: oai:DiVA.org:liu-212285DiVA, id: diva2:1945198
Available from: 2025-03-18 Created: 2025-03-18 Last updated: 2025-03-18

Open Access in DiVA

No full text in DiVA

Other links

Publisher's full textPubMedScopus

Search in DiVA

By author/editor
Khan, Fahad
By organisation
Computer VisionFaculty of Science & Engineering
In the same journal
IEEE Transactions on Pattern Analysis and Machine Intelligence
Computer graphics and computer vision

Search outside of DiVA

GoogleGoogle Scholar

doi
pubmed
urn-nbn

Altmetric score

doi
pubmed
urn-nbn
Total: 252 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • oxford
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf