liu.seSearch for publications in DiVA
Change search
Link to record
Permanent link

Direct link
Kuhlmann, Marco, ProfessorORCID iD iconorcid.org/0000-0002-2492-9872
Alternative names
Publications (10 of 53) Show all publications
Drewes, F., Kuhlmann, M. & Torstensson, O. (2026). Dynamically weighted tree transducers. In: Implementation and Application of Automata: 29th International Conference, CIAA 2025, Palermo, Italy, September 22–25, 2025, Proceedings. Paper presented at 29th International Conference on Implementation and Application of Automata, CIAA 2025, Palermo, Italy, September 22-25, 2025 (pp. 115-128). Cham: Springer
Open this publication in new window or tab >>Dynamically weighted tree transducers
2026 (English)In: Implementation and Application of Automata: 29th International Conference, CIAA 2025, Palermo, Italy, September 22–25, 2025, Proceedings, Cham: Springer , 2026, p. 115-128Conference paper, Published paper (Refereed)
Abstract [en]

We introduce dynamically weighted tree transducers (dyn-wtts), a weighted generalization of top-down tree transducers with regular look-ahead in which rule weights are determined by external tree weighters mapping input trees to values in a commutative semiring. The general framework allows for any kind of device defining a weighted tree language to serve as a tree weighter. In this paper, we focus on weighters implemented by different classes of tree automata and show how the resulting classes of weighted tree transformations relate to one another and to known classes. In particular, we show how conventional top-down weighted tree transducers (with and without regular look-ahead) can be expressed as dyn-wtts, also in the linear and non-deleting cases.

Place, publisher, year, edition, pages
Cham: Springer, 2026
Series
Lecture Notes in Computer Science, ISSN 0302-9743, E-ISSN 1611-3349 ; 15981
Keywords
Regular look-ahead, Weighted tree automata, Weighted tree transducers
National Category
Computer Sciences
Identifiers
urn:nbn:se:liu:diva-219598 (URN)10.1007/978-3-032-02602-6_9 (DOI)001585674100009 ()2-s2.0-105014494570 (Scopus ID)9783032026019 (ISBN)9783032026026 (ISBN)
Conference
29th International Conference on Implementation and Application of Automata, CIAA 2025, Palermo, Italy, September 22-25, 2025
Funder
Wallenberg AI, Autonomous Systems and Software Program (WASP)Swedish Research Council, 2024-05318
Note

Funding Agencies|Wallenberg AI, Autonomous Systems and Software Program (WASP) - Knut and Alice Wallenberg Foundation; Swedish Research Council [2024-05318]

Available from: 2025-11-20 Created: 2025-11-20 Last updated: 2025-12-11
Sanchez Aimar, E., Helgesen, N., Xu, Y., Kuhlmann, M. & Felsberg, M. (2024). Flexible Distribution Alignment: Towards Long-Tailed Semi-supervised Learning with Proper Calibration. In: Aleš Leonardis, Elisa Ricci, Stefan Roth, Olga Russakovsky, Torsten Sattler, Gül Varol (Ed.), Computer Vision – ECCV 2024: 18th European Conference, Milan, Italy, September 29–October 4, 2024, Proceedings, Part LIV. Paper presented at 18th European Conference, Milan, Italy, September 29–October 4, 2024 (pp. 307-327). Springer Nature Switzerland, 15112
Open this publication in new window or tab >>Flexible Distribution Alignment: Towards Long-Tailed Semi-supervised Learning with Proper Calibration
Show others...
2024 (English)In: Computer Vision – ECCV 2024: 18th European Conference, Milan, Italy, September 29–October 4, 2024, Proceedings, Part LIV / [ed] Aleš Leonardis, Elisa Ricci, Stefan Roth, Olga Russakovsky, Torsten Sattler, Gül Varol, Springer Nature Switzerland , 2024, Vol. 15112, p. 307-327Conference paper, Published paper (Refereed)
Abstract [en]

Long-tailed semi-supervised learning (LTSSL) represents a practical scenario for semi-supervised applications, challenged by skewed labeled distributions that bias classifiers. This problem is often aggravated by discrepancies between labeled and unlabeled class distributions, leading to biased pseudo-labels, neglect of rare classes, and poorly calibrated probabilities. To address these issues, we introduce Flexible Distribution Alignment (FlexDA), a novel adaptive logit-adjusted loss framework designed to dynamically estimate and align predictions with the actual distribution of unlabeled data and achieve a balanced classifier by the end of training. FlexDA is further enhanced by a distillation-based consistency loss, promoting fair data usage across classes and effectively leveraging underconfident samples. This method, encapsulated in ADELLO (Align and Distill Everything All at Once), proves robust against label shift, significantly improves model calibration in LTSSL contexts, and surpasses previous state-of-of-art approaches across multiple benchmarks, including CIFAR100-LT, STL10-LT, and ImageNet127, addressing class imbalance challenges in semi-supervised learning. Our code is available at https://github.com/emasa/ADELLO-LTSSL.

Place, publisher, year, edition, pages
Springer Nature Switzerland, 2024
Series
Lecture Notes in Computer Science, ISSN 0302-9743, E-ISSN 1611-3349 ; 15112
National Category
Computer Systems
Identifiers
urn:nbn:se:liu:diva-209223 (URN)10.1007/978-3-031-72949-2_18 (DOI)001352860600018 ()2-s2.0-85208545165 (Scopus ID)9783031729485 (ISBN)9783031729492 (ISBN)
Conference
18th European Conference, Milan, Italy, September 29–October 4, 2024
Note

Funding Agencies|Wallenberg Artificial Intelligence, Autonomous Systems and Software Program (WASP) - Knut and Alice Wallenberg Foundation; Swedish Research Council [2022-06725]; Knut and Alice Wallenberg Foundation at the National Supercomputer Centre

Available from: 2024-11-06 Created: 2024-11-06 Last updated: 2025-11-18
Doostmohammadi, E., Holmström, O. & Kuhlmann, M. (2024). How Reliable Are Automatic Evaluation Methods for Instruction-Tuned LLMs?. In: : . Paper presented at Findings of the Association for Computational Linguistics: EMNLP 2024.
Open this publication in new window or tab >>How Reliable Are Automatic Evaluation Methods for Instruction-Tuned LLMs?
2024 (English)Conference paper, Published paper (Refereed)
Abstract [en]

Work on instruction-tuned Large Language Models (LLMs) has used automatic methods based on text overlap and LLM judgments as cost-effective alternatives to human evaluation. In this paper, we perform a meta-evaluation of such methods and assess their reliability across a broad range of tasks. In evaluating how well automatic methods align with human evaluations, correlation metrics are the most commonly employed method despite their inherent limitations when dealing with ties and different scales. To address these shortcomings, we use Pairwise Accuracy as an alternative to standard correlation measures. We observe that while automatic evaluation methods can approximate human ratings under specific conditions, their validity is highly context-dependent. Specifically, the simple ROUGE-L metric correlates very well with human ratings for short-answer English tasks but is unreliable in free-form generation tasks and cross-lingual scenarios. The effectiveness of the more advanced method of using GPT-4 as a judge diminishes significantly if reference answers are not included in the prompt, which is the scenario where this method has the potential to provide the most value compared to other metrics. Our findings enhance the understanding of how automatic methods should be applied and interpreted when developing and evaluating instruction-tuned LLMs.

National Category
Computer Sciences Natural Language Processing
Identifiers
urn:nbn:se:liu:diva-210271 (URN)
Conference
Findings of the Association for Computational Linguistics: EMNLP 2024
Available from: 2024-12-06 Created: 2024-12-06 Last updated: 2025-02-01Bibliographically approved
Gestrin, E., Kuhlmann, M. & Seipp, J. (2024). NL2Plan: Robust LLM-Driven Planning from Minimal Text Descriptions. In: ICAPS 2024 Workshop on Human-Aware and Explainable Planning (HAXP): . Paper presented at The 34th International Conference on Automated Planning and Scheduling.
Open this publication in new window or tab >>NL2Plan: Robust LLM-Driven Planning from Minimal Text Descriptions
2024 (English)In: ICAPS 2024 Workshop on Human-Aware and Explainable Planning (HAXP), 2024Conference paper, Published paper (Refereed)
Abstract [en]

Today's classical planners are powerful, but modeling input tasks in formats such as PDDL is tedious and error-prone. In contrast, planning with Large Language Models (LLMs) allows for almost any input text, but offers no guarantees on plan quality or even soundness. In an attempt to merge the best of these two approaches, some work has begun to use LLMs to automate parts of the PDDL creation process. However, these methods still require various degrees of expert input. We present NL2Plan, the first domain-agnostic offline LLM-driven planning system. NL2Plan uses an LLM to incrementally extract the necessary information from a short text prompt before creating a complete PDDL description of both the domain and the problem, which is finally solved by a classical planner. We evaluate NL2Plan on four planning domains and find that it solves 10 out of 15 tasks - a clear improvement over a plain chain-of-thought reasoning LLM approach, which only solves 2 tasks. Moreover, in two out of the five failure cases, instead of returning an invalid plan, NL2Plan reports that it failed to solve the task. In addition to using NL2Plan in end-to-end mode, users can inspect and correct all of its intermediate results, such as the PDDL representation, increasing explainability and making it an assistive tool for PDDL creation.

Keywords
Large Language Model, Planning, Domain Modeling, LLMs for Planning
National Category
Computer Sciences
Identifiers
urn:nbn:se:liu:diva-215753 (URN)
Conference
The 34th International Conference on Automated Planning and Scheduling
Available from: 2025-06-27 Created: 2025-06-27 Last updated: 2025-06-27
Kunz, J. & Kuhlmann, M. (2024). Properties and Challenges of LLM-Generated Explanations. In: Su Lin Blodgett, Amanda Cercas Curry, Sunipa Dey, Michael Madaio, Ani Nenkova, Diyi Yang, Ziang Xiao (Ed.), Proceedings of the Third Workshop on Bridging Human-Computer Interaction and Natural Language Processing: . Paper presented at Third Workshop on Bridging Human-Computer Interaction and Natural Language Processing at NAACL 2024 (pp. 13-27).
Open this publication in new window or tab >>Properties and Challenges of LLM-Generated Explanations
2024 (English)In: Proceedings of the Third Workshop on Bridging Human-Computer Interaction and Natural Language Processing / [ed] Su Lin Blodgett, Amanda Cercas Curry, Sunipa Dey, Michael Madaio, Ani Nenkova, Diyi Yang, Ziang Xiao, 2024, p. 13-27Conference paper, Published paper (Refereed)
Abstract [en]

The self-rationalising capabilities of large language models (LLMs) have been explored in restricted settings, using task-specific data sets.However, current LLMs do not (only) rely on specifically annotated data; nonetheless, they frequently explain their outputs.The properties of the generated explanations are influenced by the pre-training corpus and by the target data used for instruction fine-tuning.As the pre-training corpus includes a large amount of human-written explanations “in the wild”, we hypothesise that LLMs adopt common properties of human explanations.By analysing the outputs for a multi-domain instruction fine-tuning data set, we find that generated explanations show selectivity and contain illustrative elements, but less frequently are subjective or misleading.We discuss reasons and consequences of the properties’ presence or absence. In particular, we outline positive and negative implications depending on the goals and user groups of the self-rationalising system.

Keywords
Natural Language Processing, Large Language Models, Explainability, Human-AI Interaction
National Category
Natural Language Processing
Identifiers
urn:nbn:se:liu:diva-207112 (URN)10.18653/v1/2024.hcinlp-1.2 (DOI)9798891761117 (ISBN)
Conference
Third Workshop on Bridging Human-Computer Interaction and Natural Language Processing at NAACL 2024
Available from: 2024-09-02 Created: 2024-09-02 Last updated: 2025-07-15
Holmström, O., Kunz, J. & Kuhlmann, M. (2023). Bridging the Resource Gap: Exploring the Efficacy of English and Multilingual LLMs for Swedish. In: Proceedings of the Second Workshop on Resources and Representations for Under-Resourced Languages and Domains (RESOURCEFUL-2023): . Paper presented at RESOURCEFUL workshop at NoDaLiDa (pp. 92-110). Tórshavn, the Faroe Islands
Open this publication in new window or tab >>Bridging the Resource Gap: Exploring the Efficacy of English and Multilingual LLMs for Swedish
2023 (English)In: Proceedings of the Second Workshop on Resources and Representations for Under-Resourced Languages and Domains (RESOURCEFUL-2023), Tórshavn, the Faroe Islands, 2023, p. 92-110Conference paper, Published paper (Refereed)
Abstract [en]

Large language models (LLMs) have substantially improved natural language processing (NLP) performance, but training these models from scratch is resource-intensive and challenging for smaller languages. With this paper, we want to initiate a discussion on the necessity of language-specific pre-training of LLMs. We propose how the “one model-many models” conceptual framework for task transfer can be applied to language transfer and explore this approach by evaluating the performance of non-Swedish monolingual and multilingual models’ performance on tasks in Swedish. Our findings demonstrate that LLMs exposed to limited Swedish during training can be highly capable and transfer competencies from English off-the-shelf, including emergent abilities such as mathematical reasoning, while at the same time showing distinct culturally adapted behaviour. Our results suggest that there are resourceful alternatives to language-specific pre-training when creating useful LLMs for small languages.

Place, publisher, year, edition, pages
Tórshavn, the Faroe Islands: , 2023
Keywords
NLP, Natural Language Processing, language model, GPT, monolingual, multilingual, cross-lingual, one model-many models
National Category
Natural Language Processing
Identifiers
urn:nbn:se:liu:diva-196545 (URN)
Conference
RESOURCEFUL workshop at NoDaLiDa
Funder
CUGS (National Graduate School in Computer Science)
Available from: 2023-08-11 Created: 2023-08-11 Last updated: 2025-02-07
Norlund, T., Doostmohammadi, E., Johansson, R. & Kuhlmann, M. (2023). On the Generalization Ability of Retrieval-Enhanced Transformers. In: Findings of the Association for Computational Linguistics: . Paper presented at EACL 2023, May 2-6, 2023 (pp. 1485-1493). ASSOC COMPUTATIONAL LINGUISTICS-ACL
Open this publication in new window or tab >>On the Generalization Ability of Retrieval-Enhanced Transformers
2023 (English)In: Findings of the Association for Computational Linguistics, ASSOC COMPUTATIONAL LINGUISTICS-ACL , 2023, p. 1485-1493Conference paper, Published paper (Refereed)
Abstract [en]

Recent work on the Retrieval-Enhanced Transformer (Retro) model has shown that offloading memory from trainable weights to a retrieval database can significantly improve language modeling and match the performance of non-retrieval models that are an order of magnitude larger in size. It has been suggested that at least some of this performance gain is due to non-trivial generalization based on both model weights and retrieval. In this paper, we try to better understand the relative contributions of these two components. We find that the performance gains from retrieval largely originate from over-lapping tokens between the database and the test data, suggesting less non-trivial generalization than previously assumed. More generally, our results point to the challenges of evaluating the generalization of retrieval-augmented language models such as Retro, as even limited token overlap may significantly decrease test-time loss. We release our code and model at https://github.com/TobiasNorlund/retro

Place, publisher, year, edition, pages
ASSOC COMPUTATIONAL LINGUISTICS-ACL, 2023
National Category
Natural Language Processing
Identifiers
urn:nbn:se:liu:diva-195609 (URN)001181085100107 ()9781959429470 (ISBN)
Conference
EACL 2023, May 2-6, 2023
Note

Funding Agencies|Wallenberg AI, Autonomous Systems and Software Program (WASP) - Knut and Alice Wallenberg Foundation; Swedish Research Council [2022-06725]

Available from: 2023-06-22 Created: 2023-06-22 Last updated: 2025-02-07Bibliographically approved
Doostmohammadi, E., Norlund, T., Kuhlmann, M. & Johansson, R. (2023). Surface-Based Retrieval Reduces Perplexity of Retrieval-Augmented Language Models. In: Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers): . Paper presented at 61st Annual Meeting of the the Association-for-Computational-Linguistics (ACL), Toronto, CANADA, jul 09-14, 2023 (pp. 521-529). ASSOC COMPUTATIONAL LINGUISTICS-ACL
Open this publication in new window or tab >>Surface-Based Retrieval Reduces Perplexity of Retrieval-Augmented Language Models
2023 (English)In: Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), ASSOC COMPUTATIONAL LINGUISTICS-ACL , 2023, p. 521-529Conference paper, Published paper (Refereed)
Abstract [en]

Augmenting language models with a retrieval mechanism has been shown to significantly improve their performance while keeping the number of parameters low. Retrieval-augmented models commonly rely on a semantic retrieval mechanism based on the similarity between dense representations of the query chunk and potential neighbors. In this paper, we study the state-of-the-art Retro model and observe that its performance gain is better explained by surface-level similarities, such as token overlap. Inspired by this, we replace the semantic retrieval in Retro with a surface-level method based on BM25, obtaining a significant reduction in perplexity. As full BM25 retrieval can be computationally costly for large datasets, we also apply it in a re-ranking scenario, gaining part of the perplexity reduction with minimal computational overhead.

Place, publisher, year, edition, pages
ASSOC COMPUTATIONAL LINGUISTICS-ACL, 2023
National Category
Applied Mechanics
Identifiers
urn:nbn:se:liu:diva-196564 (URN)001181088800045 ()9781959429715 (ISBN)
Conference
61st Annual Meeting of the the Association-for-Computational-Linguistics (ACL), Toronto, CANADA, jul 09-14, 2023
Note

Funding Agencies|Wallenberg AI, Autonomous Systems and Software Program (WASP) - Knut and Alice Wallenberg Foundation; Alvis - Swedish Research Council [2022-06725]; AliceWallenberg Foundation at the National Supercomputer Center

Available from: 2023-08-14 Created: 2023-08-14 Last updated: 2024-04-23
Kunz, J., Jirénius, M., Holmström, O. & Kuhlmann, M. (2022). Human Ratings Do Not Reflect Downstream Utility: A Study of Free-Text Explanations for Model Predictions. In: Jasmijn Bastings, Yonatan Belinkov, Yanai Elazar, Dieuwke Hupkes, Naomi Saphra, Sarah Wiegreffe (Ed.), Proceedings of the Fifth BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP: . Paper presented at BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP, December 8, 2022 (pp. 164-177). Association for Computational Linguistics (ACL), 5, Article ID 2022.blackboxnlp-1.14.
Open this publication in new window or tab >>Human Ratings Do Not Reflect Downstream Utility: A Study of Free-Text Explanations for Model Predictions
2022 (English)In: Proceedings of the Fifth BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP / [ed] Jasmijn Bastings, Yonatan Belinkov, Yanai Elazar, Dieuwke Hupkes, Naomi Saphra, Sarah Wiegreffe, Association for Computational Linguistics (ACL) , 2022, Vol. 5, p. 164-177, article id 2022.blackboxnlp-1.14Conference paper, Published paper (Refereed)
Abstract [en]

Models able to generate free-text rationales that explain their output have been proposed as an important step towards interpretable NLP for “reasoning” tasks such as natural language inference and commonsense question answering. However, the relative merits of different architectures and types of rationales are not well understood and hard to measure. In this paper, we contribute two insights to this line of research: First, we find that models trained on gold explanations learn to rely on these but, in the case of the more challenging question answering data set we use, fail when given generated explanations at test time. However, additional fine-tuning on generated explanations teaches the model to distinguish between reliable and unreliable information in explanations. Second, we compare explanations by a generation-only model to those generated by a self-rationalizing model and find that, while the former score higher in terms of validity, factual correctness, and similarity to gold explanations, they are not more useful for downstream classification. We observe that the self-rationalizing model is prone to hallucination, which is punished by most metrics but may add useful context for the classification step.

Place, publisher, year, edition, pages
Association for Computational Linguistics (ACL), 2022
Keywords
Large Language Models, Neural Networks, Transformers, Interpretability, Explainability
National Category
Natural Language Processing Computer Sciences
Identifiers
urn:nbn:se:liu:diva-195615 (URN)10.18653/v1/2022.blackboxnlp-1.14 (DOI)2-s2.0-85152897033 (Scopus ID)
Conference
BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP, December 8, 2022
Available from: 2023-06-22 Created: 2023-06-22 Last updated: 2025-02-01Bibliographically approved
Doostmohammadi, E. & Kuhlmann, M. (2022). On the Effects of Video Grounding on Language Models. In: Proceedings of the First Workshop on Performance and Interpretability Evaluations of Multimodal, Multipurpose, Massive-Scale Models: . Paper presented at First Workshop on Performance and Interpretability Evaluations of Multimodal, Multipurpose, Massive-Scale Models.
Open this publication in new window or tab >>On the Effects of Video Grounding on Language Models
2022 (English)In: Proceedings of the First Workshop on Performance and Interpretability Evaluations of Multimodal, Multipurpose, Massive-Scale Models, 2022Conference paper, Oral presentation with published abstract (Other academic)
Abstract [en]

Transformer-based models trained on text and vision modalities try to improve the performance on multimodal downstream tasks or tackle the problem Transformer-based models trained on text and vision modalities try to improve the performance on multimodal downstream tasks or tackle the problem of lack of grounding, e.g., addressing issues like models’ insufficient commonsense knowledge. While it is more straightforward to evaluate the effects of such models on multimodal tasks, such as visual question answering or image captioning, it is not as well-understood how these tasks affect the model itself, and its internal linguistic representations. In this work, we experiment with language models grounded in videos and measure the models’ performance on predicting masked words chosen based on their imageability. The results show that the smaller model benefits from video grounding in predicting highly imageable words, while the results for the larger model seem harder to interpret.of lack of grounding, e.g., addressing issues like models’ insufficient commonsense knowledge. While it is more straightforward to evaluate the effects of such models on multimodal tasks, such as visual question answering or image captioning, it is not as well-understood how these tasks affect the model itself, and its internal linguistic representations. In this work, we experiment with language models grounded in videos and measure the models’ performance on predicting masked words chosen based on their imageability. The results show that the smaller model benefits from video grounding in predicting highly imageable words, while the results for the larger model seem harder to interpret.

National Category
Natural Language Processing
Identifiers
urn:nbn:se:liu:diva-198261 (URN)
Conference
First Workshop on Performance and Interpretability Evaluations of Multimodal, Multipurpose, Massive-Scale Models
Available from: 2023-10-02 Created: 2023-10-02 Last updated: 2025-02-07Bibliographically approved
Projects
Accurate and efficient non-projective dependency parsing [2008-00296_VR]; Uppsala University
Organisations
Identifiers
ORCID iD: ORCID iD iconorcid.org/0000-0002-2492-9872

Search in DiVA

Show all publications