liu.seSearch for publications in DiVA
Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • oxford
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Surface-Based Retrieval Reduces Perplexity of Retrieval-Augmented Language Models
Linköping University, Department of Computer and Information Science, Artificial Intelligence and Integrated Computer Systems. Linköping University, Faculty of Science & Engineering.ORCID iD: 0000-0002-5633-5307
Chalmers Univ Technol, Sweden; Recorded Future, Sweden.
Linköping University, Department of Computer and Information Science, Artificial Intelligence and Integrated Computer Systems. Linköping University, Faculty of Science & Engineering.ORCID iD: 0000-0002-2492-9872
Chalmers Univ Technol, Sweden; Univ Gothenburg, Sweden.
2023 (English)In: Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers) / [ed] Anna Rogers, Jordan Boyd-Graber, Naoaki Okazaki, Association for Computational Linguistics , 2023, p. 521-529Conference paper, Published paper (Refereed)
Abstract [en]

Augmenting language models with a retrieval mechanism has been shown to significantly improve their performance while keeping the number of parameters low. Retrieval-augmented models commonly rely on a semantic retrieval mechanism based on the similarity between dense representations of the query chunk and potential neighbors. In this paper, we study the state-of-the-art Retro model and observe that its performance gain is better explained by surface-level similarities, such as token overlap. Inspired by this, we replace the semantic retrieval in Retro with a surface-level method based on BM25, obtaining a significant reduction in perplexity. As full BM25 retrieval can be computationally costly for large datasets, we also apply it in a re-ranking scenario, gaining part of the perplexity reduction with minimal computational overhead.

Place, publisher, year, edition, pages
Association for Computational Linguistics , 2023. p. 521-529
National Category
Applied Mechanics
Identifiers
URN: urn:nbn:se:liu:diva-196564DOI: 10.18653/v1/2023.acl-short.45ISI: 001181088800045Scopus ID: 2-s2.0-85172191772ISBN: 9781959429715 (print)OAI: oai:DiVA.org:liu-196564DiVA, id: diva2:1787383
Conference
61st Annual Meeting of the the Association-for-Computational-Linguistics (ACL), Toronto, CANADA, jul 09-14, 2023
Note

Funding Agencies|Wallenberg AI, Autonomous Systems and Software Program (WASP) - Knut and Alice Wallenberg Foundation; Alvis - Swedish Research Council [2022-06725]; AliceWallenberg Foundation at the National Supercomputer Center

Available from: 2023-08-14 Created: 2023-08-14 Last updated: 2026-02-20
In thesis
1. Toward Understanding and Enhancing the Training and Evaluation of Language Models: A Study on Vision, Instruction Tuning, and Retrieval Augmentation
Open this publication in new window or tab >>Toward Understanding and Enhancing the Training and Evaluation of Language Models: A Study on Vision, Instruction Tuning, and Retrieval Augmentation
2026 (English)Doctoral thesis, comprehensive summary (Other academic)
Abstract [en]

This dissertation advances two complementary aims in the study of large language models: (i) understanding their inner workings and (ii) improving their training and evaluation. It does so through three lines of inquiry: integrating visual signals into language modeling, instruction tuning for English and a low-resource language (Swedish), and retrieval augmentation.

First, to study multimodal grounding, pretrained masked language models are exposed to tokenized video alongside aligned text, enabling analysis of how visual context influences next token prediction. Using the psycholinguistically motivated notion of imageability as an interpretable probe, the work shows that video grounding strengthens representations for concrete, highly imageable words, with the effect most consistent in a smaller model. For less imageable words, gains are mixed, and larger models exhibit increased reliance on visual context. These findings indicate that visual grounding benefits are not uniform; they depend on lexical properties and model capacity, and imageability offers a principled lens on what video–language models internalize.

Second, the thesis develops a practical path for instruction tuning in Swedish by translating existing English instruction corpora and finetuning models of varying size and pretraining exposure. Substantial zero-shot gains demonstrate that translated synthetic instructions can substitute for costly native resources. Complementing this, the work assesses automatic evaluation for instruction-following systems using Pairwise Accuracy as a meta-evaluation criterion. It finds that reliability is task- and length-dependent: ROUGE-L is a competitive, low-cost proxy for short, format-constrained outputs; BERTScore is comparatively stronger for longer, free-form answers; and LLM-as-a-judge aligns well with human judgments primarily when provided with reference answers. Cross-lingual analyses highlight that Swedish outputs exacerbate surfacematching weaknesses and no-reference biases, refining guidance on when human assessment remains necessary.

Third, the dissertation analyzes retrieval augmentation through a RETRO-style model. It shows that perplexity reductions concentrate on tokens with lexical overlap between inputs and retrieved neighbors,revealing a dominant surface-level “copy mode.” Leveraging this, surface-focused retrieval (e.g., BM25) is used to replace the dense retrieval mechanism during inference, which reduces perplexity further within this architecture, while lightweight hybrids (semantic pre-filtering with BM25 re-ranking) recover additional gains at minimal cost. The findings also demonstrate that during pretraining, performance improves sharply once input–neighbor overlap crosses a threshold; deliberately increasing overlap with targeted paraphrases can cut training time by about 40% without degrading downstream short-answer QA, though with a modest increase in eventual perplexity.

Overall, the thesis clarifies what signals large language models actually exploit and provides actionable recommendations for data curation, model selection, metric choice, and training strategies.

Place, publisher, year, edition, pages
Linköping: Linköping University Electronic Press, 2026. p. 173
Series
Linköping Studies in Science and Technology. Dissertations, ISSN 0345-7524 ; 2502
National Category
Natural Language Processing
Identifiers
urn:nbn:se:liu:diva-221398 (URN)10.3384/9789181184440 (DOI)9789181184433 (ISBN)9789181184440 (ISBN)
Public defence
2026-03-27, Ada Lovelace, B-huset, Campus Valla, Linköping, 13:15 (English)
Opponent
Supervisors
Available from: 2026-02-20 Created: 2026-02-20 Last updated: 2026-03-05

Open Access in DiVA

fulltext(552 kB)4 downloads
File information
File name FULLTEXT01.pdfFile size 552 kBChecksum SHA-512
3d464f144ce2118ba0277942397f355c4759590044fe6112c6344b1bf750f223abb5d4cd966f3929b9841b30db2c0550242121f28cab787f0d9e273c5de3845a
Type fulltextMimetype application/pdf

Other links

Publisher's full textScopus

Authority records

Doostmohammadi, EhsanKuhlmann, Marco

Search in DiVA

By author/editor
Doostmohammadi, EhsanKuhlmann, Marco
By organisation
Artificial Intelligence and Integrated Computer SystemsFaculty of Science & Engineering
Applied Mechanics

Search outside of DiVA

GoogleGoogle Scholar
Total: 4 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

doi
isbn
urn-nbn

Altmetric score

doi
isbn
urn-nbn
Total: 256 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • oxford
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf