liu.seSearch for publications in DiVA
Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • oxford
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
How Reliable Are Automatic Evaluation Methods for Instruction-Tuned LLMs?
Linköping University, Department of Computer and Information Science, Artificial Intelligence and Integrated Computer Systems. Linköping University, Faculty of Science & Engineering.ORCID iD: 0000-0002-5633-5307
Linköping University, Department of Computer and Information Science, Artificial Intelligence and Integrated Computer Systems. Linköping University, Faculty of Science & Engineering.
Linköping University, Faculty of Science & Engineering. Linköping University, Department of Computer and Information Science, Artificial Intelligence and Integrated Computer Systems.ORCID iD: 0000-0002-2492-9872
2024 (English)In: Findings of the Association for Computational Linguistics: EMNLP 2024 / [ed] Yaser Al-Onaizan, Mohit Bansal, Yun-Nung Chen, Association for Computational Linguistics , 2024, p. 6321-6336Conference paper, Published paper (Refereed)
Abstract [en]

Work on instruction-tuned Large Language Models (LLMs) has used automatic methods based on text overlap and LLM judgments as cost-effective alternatives to human evaluation. In this paper, we perform a meta-evaluation of such methods and assess their reliability across a broad range of tasks. In evaluating how well automatic methods align with human evaluations, correlation metrics are the most commonly employed method despite their inherent limitations when dealing with ties and different scales. To address these shortcomings, we use Pairwise Accuracy as an alternative to standard correlation measures. We observe that while automatic evaluation methods can approximate human ratings under specific conditions, their validity is highly context-dependent. Specifically, the simple ROUGE-L metric correlates very well with human ratings for short-answer English tasks but is unreliable in free-form generation tasks and cross-lingual scenarios. The effectiveness of the more advanced method of using GPT-4 as a judge diminishes significantly if reference answers are not included in the prompt, which is the scenario where this method has the potential to provide the most value compared to other metrics. Our findings enhance the understanding of how automatic methods should be applied and interpreted when developing and evaluating instruction-tuned LLMs.

Place, publisher, year, edition, pages
Association for Computational Linguistics , 2024. p. 6321-6336
National Category
Computer Sciences Natural Language Processing
Identifiers
URN: urn:nbn:se:liu:diva-210271DOI: 10.18653/v1/2024.findings-emnlp.367ISI: 001511154406029OAI: oai:DiVA.org:liu-210271DiVA, id: diva2:1919099
Conference
EMNLP 2024, Miami, Florida, USA
Note

Funding Agencies|Wallenberg AI, Autonomous Systems and Software Program (WASP); European Union [101135671]; National Graduate School of Computer Science in Sweden (CUGS); Knut and Alice Wallenberg Foundation

Available from: 2024-12-06 Created: 2024-12-06 Last updated: 2026-02-20Bibliographically approved
In thesis
1. Toward Understanding and Enhancing the Training and Evaluation of Language Models: A Study on Vision, Instruction Tuning, and Retrieval Augmentation
Open this publication in new window or tab >>Toward Understanding and Enhancing the Training and Evaluation of Language Models: A Study on Vision, Instruction Tuning, and Retrieval Augmentation
2026 (English)Doctoral thesis, comprehensive summary (Other academic)
Abstract [en]

This dissertation advances two complementary aims in the study of large language models: (i) understanding their inner workings and (ii) improving their training and evaluation. It does so through three lines of inquiry: integrating visual signals into language modeling, instruction tuning for English and a low-resource language (Swedish), and retrieval augmentation.

First, to study multimodal grounding, pretrained masked language models are exposed to tokenized video alongside aligned text, enabling analysis of how visual context influences next token prediction. Using the psycholinguistically motivated notion of imageability as an interpretable probe, the work shows that video grounding strengthens representations for concrete, highly imageable words, with the effect most consistent in a smaller model. For less imageable words, gains are mixed, and larger models exhibit increased reliance on visual context. These findings indicate that visual grounding benefits are not uniform; they depend on lexical properties and model capacity, and imageability offers a principled lens on what video–language models internalize.

Second, the thesis develops a practical path for instruction tuning in Swedish by translating existing English instruction corpora and finetuning models of varying size and pretraining exposure. Substantial zero-shot gains demonstrate that translated synthetic instructions can substitute for costly native resources. Complementing this, the work assesses automatic evaluation for instruction-following systems using Pairwise Accuracy as a meta-evaluation criterion. It finds that reliability is task- and length-dependent: ROUGE-L is a competitive, low-cost proxy for short, format-constrained outputs; BERTScore is comparatively stronger for longer, free-form answers; and LLM-as-a-judge aligns well with human judgments primarily when provided with reference answers. Cross-lingual analyses highlight that Swedish outputs exacerbate surfacematching weaknesses and no-reference biases, refining guidance on when human assessment remains necessary.

Third, the dissertation analyzes retrieval augmentation through a RETRO-style model. It shows that perplexity reductions concentrate on tokens with lexical overlap between inputs and retrieved neighbors,revealing a dominant surface-level “copy mode.” Leveraging this, surface-focused retrieval (e.g., BM25) is used to replace the dense retrieval mechanism during inference, which reduces perplexity further within this architecture, while lightweight hybrids (semantic pre-filtering with BM25 re-ranking) recover additional gains at minimal cost. The findings also demonstrate that during pretraining, performance improves sharply once input–neighbor overlap crosses a threshold; deliberately increasing overlap with targeted paraphrases can cut training time by about 40% without degrading downstream short-answer QA, though with a modest increase in eventual perplexity.

Overall, the thesis clarifies what signals large language models actually exploit and provides actionable recommendations for data curation, model selection, metric choice, and training strategies.

Place, publisher, year, edition, pages
Linköping: Linköping University Electronic Press, 2026. p. 173
Series
Linköping Studies in Science and Technology. Dissertations, ISSN 0345-7524 ; 2502
National Category
Natural Language Processing
Identifiers
urn:nbn:se:liu:diva-221398 (URN)10.3384/9789181184440 (DOI)9789181184433 (ISBN)9789181184440 (ISBN)
Public defence
2026-03-27, Ada Lovelace, B-huset, Campus Valla, Linköping, 13:15 (English)
Opponent
Supervisors
Available from: 2026-02-20 Created: 2026-02-20 Last updated: 2026-03-05

Open Access in DiVA

fulltext(633 kB)7 downloads
File information
File name FULLTEXT02.pdfFile size 633 kBChecksum SHA-512
f9ab20dd6c35aed0ec826e01d121811a989b5c5ee2f720c3700a75abe8e31b3c587bb05166e6e8e85237ae4f5d2e8723c7588bee7fefbdc2941b3835f68843cb
Type fulltextMimetype application/pdf

Other links

Publisher's full text

Authority records

Doostmohammadi, EhsanHolmström, OskarKuhlmann, Marco

Search in DiVA

By author/editor
Doostmohammadi, EhsanHolmström, OskarKuhlmann, Marco
By organisation
Artificial Intelligence and Integrated Computer SystemsFaculty of Science & Engineering
Computer SciencesNatural Language Processing

Search outside of DiVA

GoogleGoogle Scholar
Total: 7 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

doi
urn-nbn

Altmetric score

doi
urn-nbn
Total: 192 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • oxford
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf