liu.seSearch for publications in DiVA
Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • oxford
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Intriguing Properties of Vision Transformers
Australian Natl Univ, Australia; Mohamed Bin Zayed Univ AI, U Arab Emirates.
Mohamed Bin Zayed Univ AI, U Arab Emirates; SUNY Stony Brook, NY 11794 USA.
Australian Natl Univ, Australia; Mohamed Bin Zayed Univ AI, U Arab Emirates.
Monash Univ, Australia.
Show others and affiliations
2021 (English)In: ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 34 (NEURIPS 2021), NEURAL INFORMATION PROCESSING SYSTEMS (NIPS) , 2021, Vol. 34Conference paper, Published paper (Refereed)
Abstract [en]

Vision transformers (ViT) have demonstrated impressive performance across numerous machine vision tasks. These models are based on multi-head self-attention mechanisms that can flexibly attend to a sequence of image patches to encode contextual cues. An important question is how such flexibility (in attending image-wide context conditioned on a given patch) can facilitate handling nuisances in natural images e.g., severe occlusions, domain shifts, spatial permutations, adversarial and natural perturbations. We systematically study this question via an extensive set of experiments encompassing three ViT families and provide comparisons with a high-performing convolutional neural network (CNN). We show and analyze the following intriguing properties of ViT: (a) Transformers are highly robust to severe occlusions, perturbations and domain shifts, e.g., retain as high as 60% top-1 accuracy on ImageNet even after randomly occluding 80% of the image content. (b) The robustness towards occlusions is not due to texture bias, instead we show that ViTs are significantly less biased towards local textures, compared to CNNs. When properly trained to encode shape-based features, ViTs demonstrate shape recognition capability comparable to that of human visual system, previously unmatched in the literature. (c) Using ViTs to encode shape representation leads to an interesting consequence of accurate semantic segmentation without pixel-level supervision. (d) Off-the-shelf features from a single ViT model can be combined to create a feature ensemble, leading to high accuracy rates across a range of classification datasets in both traditional and few-shot learning paradigms. We show effective features of ViTs are due to flexible and dynamic receptive fields possible via self-attention mechanisms. Code: https://git.io/Js15X.

Place, publisher, year, edition, pages
NEURAL INFORMATION PROCESSING SYSTEMS (NIPS) , 2021. Vol. 34
Series
Advances in Neural Information Processing Systems, ISSN 1049-5258
National Category
Probability Theory and Statistics
Identifiers
URN: urn:nbn:se:liu:diva-209602ISI: 000922928204042OAI: oai:DiVA.org:liu-209602DiVA, id: diva2:1914090
Conference
35th Conference on Neural Information Processing Systems (NeurIPS), ELECTR NETWORK, dec 06-14, 2021
Note

Funding Agencies|NSF CAREER grant [1149783]; VR starting grant [2016-05543]; Australian Research Council DECRA fellowship [DE200101100]

Available from: 2024-11-18 Created: 2024-11-18 Last updated: 2024-11-18

Open Access in DiVA

No full text in DiVA

Search in DiVA

By author/editor
Khan, Fahad
By organisation
Computer VisionFaculty of Science & Engineering
Probability Theory and Statistics

Search outside of DiVA

GoogleGoogle Scholar

urn-nbn

Altmetric score

urn-nbn
Total: 43 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • oxford
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf