Open this publication in new window or tab >>2026 (English)Doctoral thesis, comprehensive summary (Other academic)
Abstract [en]
This dissertation advances two complementary aims in the study of large language models: (i) understanding their inner workings and (ii) improving their training and evaluation. It does so through three lines of inquiry: integrating visual signals into language modeling, instruction tuning for English and a low-resource language (Swedish), and retrieval augmentation.
First, to study multimodal grounding, pretrained masked language models are exposed to tokenized video alongside aligned text, enabling analysis of how visual context influences next token prediction. Using the psycholinguistically motivated notion of imageability as an interpretable probe, the work shows that video grounding strengthens representations for concrete, highly imageable words, with the effect most consistent in a smaller model. For less imageable words, gains are mixed, and larger models exhibit increased reliance on visual context. These findings indicate that visual grounding benefits are not uniform; they depend on lexical properties and model capacity, and imageability offers a principled lens on what video–language models internalize.
Second, the thesis develops a practical path for instruction tuning in Swedish by translating existing English instruction corpora and finetuning models of varying size and pretraining exposure. Substantial zero-shot gains demonstrate that translated synthetic instructions can substitute for costly native resources. Complementing this, the work assesses automatic evaluation for instruction-following systems using Pairwise Accuracy as a meta-evaluation criterion. It finds that reliability is task- and length-dependent: ROUGE-L is a competitive, low-cost proxy for short, format-constrained outputs; BERTScore is comparatively stronger for longer, free-form answers; and LLM-as-a-judge aligns well with human judgments primarily when provided with reference answers. Cross-lingual analyses highlight that Swedish outputs exacerbate surfacematching weaknesses and no-reference biases, refining guidance on when human assessment remains necessary.
Third, the dissertation analyzes retrieval augmentation through a RETRO-style model. It shows that perplexity reductions concentrate on tokens with lexical overlap between inputs and retrieved neighbors,revealing a dominant surface-level “copy mode.” Leveraging this, surface-focused retrieval (e.g., BM25) is used to replace the dense retrieval mechanism during inference, which reduces perplexity further within this architecture, while lightweight hybrids (semantic pre-filtering with BM25 re-ranking) recover additional gains at minimal cost. The findings also demonstrate that during pretraining, performance improves sharply once input–neighbor overlap crosses a threshold; deliberately increasing overlap with targeted paraphrases can cut training time by about 40% without degrading downstream short-answer QA, though with a modest increase in eventual perplexity.
Overall, the thesis clarifies what signals large language models actually exploit and provides actionable recommendations for data curation, model selection, metric choice, and training strategies.
Place, publisher, year, edition, pages
Linköping: Linköping University Electronic Press, 2026. p. 173
Series
Linköping Studies in Science and Technology. Dissertations, ISSN 0345-7524 ; 2502
National Category
Natural Language Processing
Identifiers
urn:nbn:se:liu:diva-221398 (URN)10.3384/9789181184440 (DOI)9789181184433 (ISBN)9789181184440 (ISBN)
Public defence
2026-03-27, Ada Lovelace, B-huset, Campus Valla, Linköping, 13:15 (English)
Opponent
Supervisors
2026-02-202026-02-202026-03-05