liu.seSearch for publications in DiVA
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • oxford
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Toward Understanding and Enhancing the Training and Evaluation of Language Models: A Study on Vision, Instruction Tuning, and Retrieval Augmentation
Linköping University, Department of Computer and Information Science, Artificial Intelligence and Integrated Computer Systems. Linköping University, Faculty of Science & Engineering.ORCID iD: 0000-0002-5633-5307
2026 (English)Doctoral thesis, comprehensive summary (Other academic)
Abstract [en]

This dissertation advances two complementary aims in the study of large language models: (i) understanding their inner workings and (ii) improving their training and evaluation. It does so through three lines of inquiry: integrating visual signals into language modeling, instruction tuning for English and a low-resource language (Swedish), and retrieval augmentation.

First, to study multimodal grounding, pretrained masked language models are exposed to tokenized video alongside aligned text, enabling analysis of how visual context influences next token prediction. Using the psycholinguistically motivated notion of imageability as an interpretable probe, the work shows that video grounding strengthens representations for concrete, highly imageable words, with the effect most consistent in a smaller model. For less imageable words, gains are mixed, and larger models exhibit increased reliance on visual context. These findings indicate that visual grounding benefits are not uniform; they depend on lexical properties and model capacity, and imageability offers a principled lens on what video–language models internalize.

Second, the thesis develops a practical path for instruction tuning in Swedish by translating existing English instruction corpora and finetuning models of varying size and pretraining exposure. Substantial zero-shot gains demonstrate that translated synthetic instructions can substitute for costly native resources. Complementing this, the work assesses automatic evaluation for instruction-following systems using Pairwise Accuracy as a meta-evaluation criterion. It finds that reliability is task- and length-dependent: ROUGE-L is a competitive, low-cost proxy for short, format-constrained outputs; BERTScore is comparatively stronger for longer, free-form answers; and LLM-as-a-judge aligns well with human judgments primarily when provided with reference answers. Cross-lingual analyses highlight that Swedish outputs exacerbate surfacematching weaknesses and no-reference biases, refining guidance on when human assessment remains necessary.

Third, the dissertation analyzes retrieval augmentation through a RETRO-style model. It shows that perplexity reductions concentrate on tokens with lexical overlap between inputs and retrieved neighbors,revealing a dominant surface-level “copy mode.” Leveraging this, surface-focused retrieval (e.g., BM25) is used to replace the dense retrieval mechanism during inference, which reduces perplexity further within this architecture, while lightweight hybrids (semantic pre-filtering with BM25 re-ranking) recover additional gains at minimal cost. The findings also demonstrate that during pretraining, performance improves sharply once input–neighbor overlap crosses a threshold; deliberately increasing overlap with targeted paraphrases can cut training time by about 40% without degrading downstream short-answer QA, though with a modest increase in eventual perplexity.

Overall, the thesis clarifies what signals large language models actually exploit and provides actionable recommendations for data curation, model selection, metric choice, and training strategies.

Place, publisher, year, edition, pages
Linköping: Linköping University Electronic Press, 2026. , p. 173
Series
Linköping Studies in Science and Technology. Dissertations, ISSN 0345-7524 ; 2502
National Category
Natural Language Processing
Identifiers
URN: urn:nbn:se:liu:diva-221398DOI: 10.3384/9789181184440ISBN: 9789181184433 (print)ISBN: 9789181184440 (electronic)OAI: oai:DiVA.org:liu-221398DiVA, id: diva2:2040432
Public defence
2026-03-27, Ada Lovelace, B-huset, Campus Valla, Linköping, 13:15 (English)
Opponent
Supervisors
Available from: 2026-02-20 Created: 2026-02-20 Last updated: 2026-03-05
List of papers
1. On the Effects of Video Grounding on Language Models
Open this publication in new window or tab >>On the Effects of Video Grounding on Language Models
2022 (English)In: Proceedings of the First Workshop on Performance and Interpretability Evaluations of Multimodal, Multipurpose, Massive-Scale Models, International Conference on Computational Linguistics , 2022Conference paper, Oral presentation with published abstract (Other academic)
Abstract [en]

Transformer-based models trained on text and vision modalities try to improve the performance on multimodal downstream tasks or tackle the problem Transformer-based models trained on text and vision modalities try to improve the performance on multimodal downstream tasks or tackle the problem of lack of grounding, e.g., addressing issues like models’ insufficient commonsense knowledge. While it is more straightforward to evaluate the effects of such models on multimodal tasks, such as visual question answering or image captioning, it is not as well-understood how these tasks affect the model itself, and its internal linguistic representations. In this work, we experiment with language models grounded in videos and measure the models’ performance on predicting masked words chosen based on their imageability. The results show that the smaller model benefits from video grounding in predicting highly imageable words, while the results for the larger model seem harder to interpret.of lack of grounding, e.g., addressing issues like models’ insufficient commonsense knowledge. While it is more straightforward to evaluate the effects of such models on multimodal tasks, such as visual question answering or image captioning, it is not as well-understood how these tasks affect the model itself, and its internal linguistic representations. In this work, we experiment with language models grounded in videos and measure the models’ performance on predicting masked words chosen based on their imageability. The results show that the smaller model benefits from video grounding in predicting highly imageable words, while the results for the larger model seem harder to interpret.

Place, publisher, year, edition, pages
International Conference on Computational Linguistics, 2022
National Category
Natural Language Processing
Identifiers
urn:nbn:se:liu:diva-198261 (URN)
Conference
First Workshop on Performance and Interpretability Evaluations of Multimodal, Multipurpose, Massive-Scale Models
Available from: 2023-10-02 Created: 2023-10-02 Last updated: 2026-02-20Bibliographically approved
2. Making Instruction Finetuning Accessible to Non-English Languages: A Case Study on Swedish Models
Open this publication in new window or tab >>Making Instruction Finetuning Accessible to Non-English Languages: A Case Study on Swedish Models
2023 (English)In: Proceedings of the 24th Nordic Conference on Computational Linguistics (NoDaLiDa), 2023, p. 634-642Conference paper, Published paper (Refereed)
Abstract [en]

In recent years, instruction finetuning models have received increased attention due to their remarkable zero-shot and generalization capabilities. However, the widespread implementation of these models has been limited to the English language, largely due to the costs and challenges associated with creating instruction datasets. To overcome this, automatic instruction generation has been proposed as a resourceful alternative. We see this as an opportunity for the adoption of instruction finetuning for other languages. In this paper we explore the viability of instruction finetuning for Swedish. We translate a dataset of generated instructions from English to Swedish, using it to finetune both Swedish and non-Swedish models. Results indicate that the use of translated instructions significantly improves the models’ zero-shot performance, even on unseen data, while staying competitive with strong baselines ten times in size. We see this paper is a first step and a proof of concept that instruction finetuning for Swedish is within reach, through resourceful means, and that there exist several directions for further improvements.

Keywords
NLP, natural language processing, language models, gpt, instruction tuning, instruction finetuning, multilingual, zero-shot
National Category
Natural Language Processing
Identifiers
urn:nbn:se:liu:diva-196546 (URN)
Conference
NoDaLiDa
Funder
CUGS (National Graduate School in Computer Science)Wallenberg AI, Autonomous Systems and Software Program (WASP)
Available from: 2023-08-11 Created: 2023-08-11 Last updated: 2026-02-20
3. How Reliable Are Automatic Evaluation Methods for Instruction-Tuned LLMs?
Open this publication in new window or tab >>How Reliable Are Automatic Evaluation Methods for Instruction-Tuned LLMs?
2024 (English)In: Findings of the Association for Computational Linguistics: EMNLP 2024 / [ed] Yaser Al-Onaizan, Mohit Bansal, Yun-Nung Chen, Association for Computational Linguistics , 2024, p. 6321-6336Conference paper, Published paper (Refereed)
Abstract [en]

Work on instruction-tuned Large Language Models (LLMs) has used automatic methods based on text overlap and LLM judgments as cost-effective alternatives to human evaluation. In this paper, we perform a meta-evaluation of such methods and assess their reliability across a broad range of tasks. In evaluating how well automatic methods align with human evaluations, correlation metrics are the most commonly employed method despite their inherent limitations when dealing with ties and different scales. To address these shortcomings, we use Pairwise Accuracy as an alternative to standard correlation measures. We observe that while automatic evaluation methods can approximate human ratings under specific conditions, their validity is highly context-dependent. Specifically, the simple ROUGE-L metric correlates very well with human ratings for short-answer English tasks but is unreliable in free-form generation tasks and cross-lingual scenarios. The effectiveness of the more advanced method of using GPT-4 as a judge diminishes significantly if reference answers are not included in the prompt, which is the scenario where this method has the potential to provide the most value compared to other metrics. Our findings enhance the understanding of how automatic methods should be applied and interpreted when developing and evaluating instruction-tuned LLMs.

Place, publisher, year, edition, pages
Association for Computational Linguistics, 2024
National Category
Computer Sciences Natural Language Processing
Identifiers
urn:nbn:se:liu:diva-210271 (URN)10.18653/v1/2024.findings-emnlp.367 (DOI)001511154406029 ()
Conference
EMNLP 2024, Miami, Florida, USA
Note

Funding Agencies|Wallenberg AI, Autonomous Systems and Software Program (WASP); European Union [101135671]; National Graduate School of Computer Science in Sweden (CUGS); Knut and Alice Wallenberg Foundation

Available from: 2024-12-06 Created: 2024-12-06 Last updated: 2026-02-20Bibliographically approved
4. On the Generalization Ability of Retrieval-Enhanced Transformers
Open this publication in new window or tab >>On the Generalization Ability of Retrieval-Enhanced Transformers
2023 (English)In: Findings of the Association for Computational Linguistics: EACL 2023 / [ed] Andreas Vlachos, Isabelle Augenstein, Association for Computational Linguistics , 2023, p. 1485-1493Conference paper, Published paper (Refereed)
Abstract [en]

Recent work on the Retrieval-Enhanced Transformer (Retro) model has shown that offloading memory from trainable weights to a retrieval database can significantly improve language modeling and match the performance of non-retrieval models that are an order of magnitude larger in size. It has been suggested that at least some of this performance gain is due to non-trivial generalization based on both model weights and retrieval. In this paper, we try to better understand the relative contributions of these two components. We find that the performance gains from retrieval largely originate from over-lapping tokens between the database and the test data, suggesting less non-trivial generalization than previously assumed. More generally, our results point to the challenges of evaluating the generalization of retrieval-augmented language models such as Retro, as even limited token overlap may significantly decrease test-time loss. We release our code and model at https://github.com/TobiasNorlund/retro

Place, publisher, year, edition, pages
Association for Computational Linguistics, 2023
National Category
Natural Language Processing
Identifiers
urn:nbn:se:liu:diva-195609 (URN)10.18653/v1/2023.findings-eacl.109 (DOI)001181085100107 ()2-s2.0-85159856506 (Scopus ID)9781959429470 (ISBN)
Conference
EACL 2023, May 2-6, 2023, Dubrovnik, Croatia
Note

Funding Agencies|Wallenberg AI, Autonomous Systems and Software Program (WASP) - Knut and Alice Wallenberg Foundation; Swedish Research Council [2022-06725]

Available from: 2023-06-22 Created: 2023-06-22 Last updated: 2026-02-20Bibliographically approved
5. Surface-Based Retrieval Reduces Perplexity of Retrieval-Augmented Language Models
Open this publication in new window or tab >>Surface-Based Retrieval Reduces Perplexity of Retrieval-Augmented Language Models
2023 (English)In: Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers) / [ed] Anna Rogers, Jordan Boyd-Graber, Naoaki Okazaki, Association for Computational Linguistics , 2023, p. 521-529Conference paper, Published paper (Refereed)
Abstract [en]

Augmenting language models with a retrieval mechanism has been shown to significantly improve their performance while keeping the number of parameters low. Retrieval-augmented models commonly rely on a semantic retrieval mechanism based on the similarity between dense representations of the query chunk and potential neighbors. In this paper, we study the state-of-the-art Retro model and observe that its performance gain is better explained by surface-level similarities, such as token overlap. Inspired by this, we replace the semantic retrieval in Retro with a surface-level method based on BM25, obtaining a significant reduction in perplexity. As full BM25 retrieval can be computationally costly for large datasets, we also apply it in a re-ranking scenario, gaining part of the perplexity reduction with minimal computational overhead.

Place, publisher, year, edition, pages
Association for Computational Linguistics, 2023
National Category
Applied Mechanics
Identifiers
urn:nbn:se:liu:diva-196564 (URN)10.18653/v1/2023.acl-short.45 (DOI)001181088800045 ()2-s2.0-85172191772 (Scopus ID)9781959429715 (ISBN)
Conference
61st Annual Meeting of the the Association-for-Computational-Linguistics (ACL), Toronto, CANADA, jul 09-14, 2023
Note

Funding Agencies|Wallenberg AI, Autonomous Systems and Software Program (WASP) - Knut and Alice Wallenberg Foundation; Alvis - Swedish Research Council [2022-06725]; AliceWallenberg Foundation at the National Supercomputer Center

Available from: 2023-08-14 Created: 2023-08-14 Last updated: 2026-02-20
6. Studying the Role of Input-Neighbor Overlap in Retrieval-Augmented Language Models Training Efficiency
Open this publication in new window or tab >>Studying the Role of Input-Neighbor Overlap in Retrieval-Augmented Language Models Training Efficiency
2025 (English)In: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing / [ed] Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng, Association for Computational Linguistics , 2025, p. 26847-26856Conference paper, Published paper (Other academic)
Abstract [en]

Retrieval-augmented language models have demonstrated performance comparable to much larger models while requiring fewer computational resources. The effectiveness of these models crucially depends on the overlap between query and retrieved context, but the optimal degree of this overlap remains unexplored. In this paper, we systematically investigate how varying levels of query–context overlap affect model performance during both training and inference. Our experiments reveal that increased overlap initially has minimal effect, but substantially improves test-time perplexity and accelerates model learning above a critical threshold. Building on these findings, we demonstrate that deliberately increasing overlap through synthetic context can enhance data efficiency and reduce training time by approximately 40% without compromising performance. We specifically generate synthetic context through paraphrasing queries. We validate our perplexity-based findings on question-answering tasks, confirming that the benefits of retrieval-augmented language modeling extend to practical applications. Our results provide empirical evidence of significant optimization potential for retrieval mechanisms in language model pretraining.

Place, publisher, year, edition, pages
Association for Computational Linguistics, 2025
National Category
Natural Language Processing
Identifiers
urn:nbn:se:liu:diva-221400 (URN)10.18653/v1/2025.emnlp-main.1363 (DOI)9798891763326 (ISBN)
Conference
Empirical Methods in Natural Language Processing (EMNLP) 2025, Suzhou, China
Funder
Wallenberg AI, Autonomous Systems and Software Program (WASP)
Available from: 2026-02-20 Created: 2026-02-20 Last updated: 2026-02-20

Open Access in DiVA

fulltext(8816 kB)58 downloads
File information
File name FULLTEXT01.pdfFile size 8816 kBChecksum SHA-512
fd5ab661bcad6f14ac5d7b9774e826209f7c96ef8cef5fb3b2a7744b265f532f29dccfdb1e0605e3f1bf1c2d45f462689aaf91afdaa2b2797a362c6f3ce7b248
Type fulltextMimetype application/pdf
Order online >>

Other links

Publisher's full text

Authority records

Doostmohammadi, Ehsan

Search in DiVA

By author/editor
Doostmohammadi, Ehsan
By organisation
Artificial Intelligence and Integrated Computer SystemsFaculty of Science & Engineering
Natural Language Processing

Search outside of DiVA

GoogleGoogle Scholar
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

doi
isbn
urn-nbn

Altmetric score

doi
isbn
urn-nbn
Total: 2686 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • oxford
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf