liu.seSearch for publications in DiVA
Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • oxford
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Studying the Role of Input-Neighbor Overlap in Retrieval-Augmented Language Models Training Efficiency
Linköping University, Department of Computer and Information Science, Artificial Intelligence and Integrated Computer Systems. Linköping University, Faculty of Science & Engineering.ORCID iD: 0000-0002-5633-5307
Linköping University, Department of Computer and Information Science, Artificial Intelligence and Integrated Computer Systems. Linköping University, Faculty of Science & Engineering.ORCID iD: 0000-0002-2492-9872
2025 (English)In: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing / [ed] Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng, Association for Computational Linguistics , 2025, p. 26847-26856Conference paper, Published paper (Other academic)
Abstract [en]

Retrieval-augmented language models have demonstrated performance comparable to much larger models while requiring fewer computational resources. The effectiveness of these models crucially depends on the overlap between query and retrieved context, but the optimal degree of this overlap remains unexplored. In this paper, we systematically investigate how varying levels of query–context overlap affect model performance during both training and inference. Our experiments reveal that increased overlap initially has minimal effect, but substantially improves test-time perplexity and accelerates model learning above a critical threshold. Building on these findings, we demonstrate that deliberately increasing overlap through synthetic context can enhance data efficiency and reduce training time by approximately 40% without compromising performance. We specifically generate synthetic context through paraphrasing queries. We validate our perplexity-based findings on question-answering tasks, confirming that the benefits of retrieval-augmented language modeling extend to practical applications. Our results provide empirical evidence of significant optimization potential for retrieval mechanisms in language model pretraining.

Place, publisher, year, edition, pages
Association for Computational Linguistics , 2025. p. 26847-26856
National Category
Natural Language Processing
Identifiers
URN: urn:nbn:se:liu:diva-221400DOI: 10.18653/v1/2025.emnlp-main.1363ISBN: 9798891763326 (print)OAI: oai:DiVA.org:liu-221400DiVA, id: diva2:2040397
Conference
Empirical Methods in Natural Language Processing (EMNLP) 2025, Suzhou, China
Funder
Wallenberg AI, Autonomous Systems and Software Program (WASP)Available from: 2026-02-20 Created: 2026-02-20 Last updated: 2026-02-20
In thesis
1. Toward Understanding and Enhancing the Training and Evaluation of Language Models: A Study on Vision, Instruction Tuning, and Retrieval Augmentation
Open this publication in new window or tab >>Toward Understanding and Enhancing the Training and Evaluation of Language Models: A Study on Vision, Instruction Tuning, and Retrieval Augmentation
2026 (English)Doctoral thesis, comprehensive summary (Other academic)
Abstract [en]

This dissertation advances two complementary aims in the study of large language models: (i) understanding their inner workings and (ii) improving their training and evaluation. It does so through three lines of inquiry: integrating visual signals into language modeling, instruction tuning for English and a low-resource language (Swedish), and retrieval augmentation.

First, to study multimodal grounding, pretrained masked language models are exposed to tokenized video alongside aligned text, enabling analysis of how visual context influences next token prediction. Using the psycholinguistically motivated notion of imageability as an interpretable probe, the work shows that video grounding strengthens representations for concrete, highly imageable words, with the effect most consistent in a smaller model. For less imageable words, gains are mixed, and larger models exhibit increased reliance on visual context. These findings indicate that visual grounding benefits are not uniform; they depend on lexical properties and model capacity, and imageability offers a principled lens on what video–language models internalize.

Second, the thesis develops a practical path for instruction tuning in Swedish by translating existing English instruction corpora and finetuning models of varying size and pretraining exposure. Substantial zero-shot gains demonstrate that translated synthetic instructions can substitute for costly native resources. Complementing this, the work assesses automatic evaluation for instruction-following systems using Pairwise Accuracy as a meta-evaluation criterion. It finds that reliability is task- and length-dependent: ROUGE-L is a competitive, low-cost proxy for short, format-constrained outputs; BERTScore is comparatively stronger for longer, free-form answers; and LLM-as-a-judge aligns well with human judgments primarily when provided with reference answers. Cross-lingual analyses highlight that Swedish outputs exacerbate surfacematching weaknesses and no-reference biases, refining guidance on when human assessment remains necessary.

Third, the dissertation analyzes retrieval augmentation through a RETRO-style model. It shows that perplexity reductions concentrate on tokens with lexical overlap between inputs and retrieved neighbors,revealing a dominant surface-level “copy mode.” Leveraging this, surface-focused retrieval (e.g., BM25) is used to replace the dense retrieval mechanism during inference, which reduces perplexity further within this architecture, while lightweight hybrids (semantic pre-filtering with BM25 re-ranking) recover additional gains at minimal cost. The findings also demonstrate that during pretraining, performance improves sharply once input–neighbor overlap crosses a threshold; deliberately increasing overlap with targeted paraphrases can cut training time by about 40% without degrading downstream short-answer QA, though with a modest increase in eventual perplexity.

Overall, the thesis clarifies what signals large language models actually exploit and provides actionable recommendations for data curation, model selection, metric choice, and training strategies.

Place, publisher, year, edition, pages
Linköping: Linköping University Electronic Press, 2026. p. 173
Series
Linköping Studies in Science and Technology. Dissertations, ISSN 0345-7524 ; 2502
National Category
Natural Language Processing
Identifiers
urn:nbn:se:liu:diva-221398 (URN)10.3384/9789181184440 (DOI)9789181184433 (ISBN)9789181184440 (ISBN)
Public defence
2026-03-27, Ada Lovelace, B-huset, Campus Valla, Linköping, 13:15 (English)
Opponent
Supervisors
Available from: 2026-02-20 Created: 2026-02-20 Last updated: 2026-03-05

Open Access in DiVA

fulltext(566 kB)15 downloads
File information
File name FULLTEXT01.pdfFile size 566 kBChecksum SHA-512
11ae9191fb853dba539e8cbf6a018bd8458ca3644fb3385fbda98dbaecf1796219564e64c95a06e2cf78e093262d1711fdb3ec52112cd4a9b0db81e678670704
Type fulltextMimetype application/pdf

Other links

Publisher's full text

Authority records

Doostmohammadi, EhsanKuhlmann, Marco

Search in DiVA

By author/editor
Doostmohammadi, EhsanKuhlmann, Marco
By organisation
Artificial Intelligence and Integrated Computer SystemsFaculty of Science & Engineering
Natural Language Processing

Search outside of DiVA

GoogleGoogle Scholar
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

doi
isbn
urn-nbn

Altmetric score

doi
isbn
urn-nbn
Total: 1928 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • oxford
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf