liu.seSearch for publications in DiVA
Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • oxford
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
On the Effects of Video Grounding on Language Models
Linköping University, Department of Computer and Information Science, Artificial Intelligence and Integrated Computer Systems. Linköping University, Faculty of Science & Engineering.ORCID iD: 0000-0002-5633-5307
Linköping University, Faculty of Science & Engineering. Linköping University, Department of Computer and Information Science, Artificial Intelligence and Integrated Computer Systems.ORCID iD: 0000-0002-2492-9872
2022 (English)In: Proceedings of the First Workshop on Performance and Interpretability Evaluations of Multimodal, Multipurpose, Massive-Scale Models, 2022Conference paper, Oral presentation with published abstract (Other academic)
Abstract [en]

Transformer-based models trained on text and vision modalities try to improve the performance on multimodal downstream tasks or tackle the problem Transformer-based models trained on text and vision modalities try to improve the performance on multimodal downstream tasks or tackle the problem of lack of grounding, e.g., addressing issues like models’ insufficient commonsense knowledge. While it is more straightforward to evaluate the effects of such models on multimodal tasks, such as visual question answering or image captioning, it is not as well-understood how these tasks affect the model itself, and its internal linguistic representations. In this work, we experiment with language models grounded in videos and measure the models’ performance on predicting masked words chosen based on their imageability. The results show that the smaller model benefits from video grounding in predicting highly imageable words, while the results for the larger model seem harder to interpret.of lack of grounding, e.g., addressing issues like models’ insufficient commonsense knowledge. While it is more straightforward to evaluate the effects of such models on multimodal tasks, such as visual question answering or image captioning, it is not as well-understood how these tasks affect the model itself, and its internal linguistic representations. In this work, we experiment with language models grounded in videos and measure the models’ performance on predicting masked words chosen based on their imageability. The results show that the smaller model benefits from video grounding in predicting highly imageable words, while the results for the larger model seem harder to interpret.

Place, publisher, year, edition, pages
2022.
National Category
Natural Language Processing
Identifiers
URN: urn:nbn:se:liu:diva-198261OAI: oai:DiVA.org:liu-198261DiVA, id: diva2:1801535
Conference
First Workshop on Performance and Interpretability Evaluations of Multimodal, Multipurpose, Massive-Scale Models
Available from: 2023-10-02 Created: 2023-10-02 Last updated: 2025-02-07Bibliographically approved

Open Access in DiVA

No full text in DiVA

Authority records

Doostmohammadi, EhsanKuhlmann, Marco

Search in DiVA

By author/editor
Doostmohammadi, EhsanKuhlmann, Marco
By organisation
Artificial Intelligence and Integrated Computer SystemsFaculty of Science & Engineering
Natural Language Processing

Search outside of DiVA

GoogleGoogle Scholar

urn-nbn

Altmetric score

urn-nbn
Total: 171 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • oxford
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf