liu.seSearch for publications in DiVA
Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • oxford
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
On the Effects of Video Grounding on Language Models
Linköping University, Department of Computer and Information Science, Artificial Intelligence and Integrated Computer Systems. Linköping University, Faculty of Science & Engineering.ORCID iD: 0000-0002-5633-5307
Linköping University, Faculty of Science & Engineering. Linköping University, Department of Computer and Information Science, Artificial Intelligence and Integrated Computer Systems.ORCID iD: 0000-0002-2492-9872
2022 (English)In: Proceedings of the First Workshop on Performance and Interpretability Evaluations of Multimodal, Multipurpose, Massive-Scale Models, International Conference on Computational Linguistics , 2022Conference paper, Oral presentation with published abstract (Other academic)
Abstract [en]

Transformer-based models trained on text and vision modalities try to improve the performance on multimodal downstream tasks or tackle the problem Transformer-based models trained on text and vision modalities try to improve the performance on multimodal downstream tasks or tackle the problem of lack of grounding, e.g., addressing issues like models’ insufficient commonsense knowledge. While it is more straightforward to evaluate the effects of such models on multimodal tasks, such as visual question answering or image captioning, it is not as well-understood how these tasks affect the model itself, and its internal linguistic representations. In this work, we experiment with language models grounded in videos and measure the models’ performance on predicting masked words chosen based on their imageability. The results show that the smaller model benefits from video grounding in predicting highly imageable words, while the results for the larger model seem harder to interpret.of lack of grounding, e.g., addressing issues like models’ insufficient commonsense knowledge. While it is more straightforward to evaluate the effects of such models on multimodal tasks, such as visual question answering or image captioning, it is not as well-understood how these tasks affect the model itself, and its internal linguistic representations. In this work, we experiment with language models grounded in videos and measure the models’ performance on predicting masked words chosen based on their imageability. The results show that the smaller model benefits from video grounding in predicting highly imageable words, while the results for the larger model seem harder to interpret.

Place, publisher, year, edition, pages
International Conference on Computational Linguistics , 2022.
National Category
Natural Language Processing
Identifiers
URN: urn:nbn:se:liu:diva-198261OAI: oai:DiVA.org:liu-198261DiVA, id: diva2:1801535
Conference
First Workshop on Performance and Interpretability Evaluations of Multimodal, Multipurpose, Massive-Scale Models
Available from: 2023-10-02 Created: 2023-10-02 Last updated: 2026-02-20Bibliographically approved
In thesis
1. Toward Understanding and Enhancing the Training and Evaluation of Language Models: A Study on Vision, Instruction Tuning, and Retrieval Augmentation
Open this publication in new window or tab >>Toward Understanding and Enhancing the Training and Evaluation of Language Models: A Study on Vision, Instruction Tuning, and Retrieval Augmentation
2026 (English)Doctoral thesis, comprehensive summary (Other academic)
Abstract [en]

This dissertation advances two complementary aims in the study of large language models: (i) understanding their inner workings and (ii) improving their training and evaluation. It does so through three lines of inquiry: integrating visual signals into language modeling, instruction tuning for English and a low-resource language (Swedish), and retrieval augmentation.

First, to study multimodal grounding, pretrained masked language models are exposed to tokenized video alongside aligned text, enabling analysis of how visual context influences next token prediction. Using the psycholinguistically motivated notion of imageability as an interpretable probe, the work shows that video grounding strengthens representations for concrete, highly imageable words, with the effect most consistent in a smaller model. For less imageable words, gains are mixed, and larger models exhibit increased reliance on visual context. These findings indicate that visual grounding benefits are not uniform; they depend on lexical properties and model capacity, and imageability offers a principled lens on what video–language models internalize.

Second, the thesis develops a practical path for instruction tuning in Swedish by translating existing English instruction corpora and finetuning models of varying size and pretraining exposure. Substantial zero-shot gains demonstrate that translated synthetic instructions can substitute for costly native resources. Complementing this, the work assesses automatic evaluation for instruction-following systems using Pairwise Accuracy as a meta-evaluation criterion. It finds that reliability is task- and length-dependent: ROUGE-L is a competitive, low-cost proxy for short, format-constrained outputs; BERTScore is comparatively stronger for longer, free-form answers; and LLM-as-a-judge aligns well with human judgments primarily when provided with reference answers. Cross-lingual analyses highlight that Swedish outputs exacerbate surfacematching weaknesses and no-reference biases, refining guidance on when human assessment remains necessary.

Third, the dissertation analyzes retrieval augmentation through a RETRO-style model. It shows that perplexity reductions concentrate on tokens with lexical overlap between inputs and retrieved neighbors,revealing a dominant surface-level “copy mode.” Leveraging this, surface-focused retrieval (e.g., BM25) is used to replace the dense retrieval mechanism during inference, which reduces perplexity further within this architecture, while lightweight hybrids (semantic pre-filtering with BM25 re-ranking) recover additional gains at minimal cost. The findings also demonstrate that during pretraining, performance improves sharply once input–neighbor overlap crosses a threshold; deliberately increasing overlap with targeted paraphrases can cut training time by about 40% without degrading downstream short-answer QA, though with a modest increase in eventual perplexity.

Overall, the thesis clarifies what signals large language models actually exploit and provides actionable recommendations for data curation, model selection, metric choice, and training strategies.

Place, publisher, year, edition, pages
Linköping: Linköping University Electronic Press, 2026. p. 173
Series
Linköping Studies in Science and Technology. Dissertations, ISSN 0345-7524 ; 2502
National Category
Natural Language Processing
Identifiers
urn:nbn:se:liu:diva-221398 (URN)10.3384/9789181184440 (DOI)9789181184433 (ISBN)9789181184440 (ISBN)
Public defence
2026-03-27, Ada Lovelace, B-huset, Campus Valla, Linköping, 13:15 (English)
Opponent
Supervisors
Available from: 2026-02-20 Created: 2026-02-20 Last updated: 2026-03-05

Open Access in DiVA

fulltext(417 kB)11 downloads
File information
File name FULLTEXT01.pdfFile size 417 kBChecksum SHA-512
84f12f45f02d3e2f9409d7f1bda503a506175c855188d9f4560d7f43a9c4c6fdfb052340a01f121aa629713f4de9e416dff9b8079c92b7e9e41a78f31b655e10
Type fulltextMimetype application/pdf

Other links

Fulltext at ACL

Authority records

Doostmohammadi, EhsanKuhlmann, Marco

Search in DiVA

By author/editor
Doostmohammadi, EhsanKuhlmann, Marco
By organisation
Artificial Intelligence and Integrated Computer SystemsFaculty of Science & Engineering
Natural Language Processing

Search outside of DiVA

GoogleGoogle Scholar
Total: 11 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

urn-nbn

Altmetric score

urn-nbn
Total: 334 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • oxford
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf