liu.seSearch for publications in DiVA
Change search
Link to record
Permanent link

Direct link
Kuhlmann, Marco, ProfessorORCID iD iconorcid.org/0000-0002-2492-9872
Alternative names
Publications (10 of 51) Show all publications
Sanchez Aimar, E., Helgesen, N., Xu, Y., Kuhlmann, M. & Felsberg, M. (2024). Flexible Distribution Alignment: Towards Long-Tailed Semi-supervised Learning with Proper Calibration. In: Aleš Leonardis, Elisa Ricci, Stefan Roth, Olga Russakovsky, Torsten Sattler, Gül Varol (Ed.), Computer Vision – ECCV 2024: 18th European Conference, Milan, Italy, September 29–October 4, 2024, Proceedings, Part LIV. Paper presented at 18th European Conference, Milan, Italy, September 29–October 4, 2024 (pp. 307-327). Springer Nature Switzerland, 15112
Open this publication in new window or tab >>Flexible Distribution Alignment: Towards Long-Tailed Semi-supervised Learning with Proper Calibration
Show others...
2024 (English)In: Computer Vision – ECCV 2024: 18th European Conference, Milan, Italy, September 29–October 4, 2024, Proceedings, Part LIV / [ed] Aleš Leonardis, Elisa Ricci, Stefan Roth, Olga Russakovsky, Torsten Sattler, Gül Varol, Springer Nature Switzerland , 2024, Vol. 15112, p. 307-327Conference paper, Published paper (Refereed)
Abstract [en]

Long-tailed semi-supervised learning (LTSSL) represents a practical scenario for semi-supervised applications, challenged by skewed labeled distributions that bias classifiers. This problem is often aggravated by discrepancies between labeled and unlabeled class distributions, leading to biased pseudo-labels, neglect of rare classes, and poorly calibrated probabilities. To address these issues, we introduce Flexible Distribution Alignment (FlexDA), a novel adaptive logit-adjusted loss framework designed to dynamically estimate and align predictions with the actual distribution of unlabeled data and achieve a balanced classifier by the end of training. FlexDA is further enhanced by a distillation-based consistency loss, promoting fair data usage across classes and effectively leveraging underconfident samples. This method, encapsulated in ADELLO (Align and Distill Everything All at Once), proves robust against label shift, significantly improves model calibration in LTSSL contexts, and surpasses previous state-of-of-art approaches across multiple benchmarks, including CIFAR100-LT, STL10-LT, and ImageNet127, addressing class imbalance challenges in semi-supervised learning. Our code is available at https://github.com/emasa/ADELLO-LTSSL.

Place, publisher, year, edition, pages
Springer Nature Switzerland, 2024
Series
Lecture Notes in Computer Science, ISSN 0302-9743, E-ISSN 1611-3349 ; 15112
National Category
Computer Systems
Identifiers
urn:nbn:se:liu:diva-209223 (URN)10.1007/978-3-031-72949-2_18 (DOI)001352860600018 ()2-s2.0-85208545165 (Scopus ID)9783031729485 (ISBN)9783031729492 (ISBN)
Conference
18th European Conference, Milan, Italy, September 29–October 4, 2024
Note

Funding Agencies|Wallenberg Artificial Intelligence, Autonomous Systems and Software Program (WASP) - Knut and Alice Wallenberg Foundation; Swedish Research Council [2022-06725]; Knut and Alice Wallenberg Foundation at the National Supercomputer Centre

Available from: 2024-11-06 Created: 2024-11-06 Last updated: 2024-12-17
Doostmohammadi, E., Holmström, O. & Kuhlmann, M. (2024). How Reliable Are Automatic Evaluation Methods for Instruction-Tuned LLMs?. In: : . Paper presented at Findings of the Association for Computational Linguistics: EMNLP 2024.
Open this publication in new window or tab >>How Reliable Are Automatic Evaluation Methods for Instruction-Tuned LLMs?
2024 (English)Conference paper, Published paper (Refereed)
Abstract [en]

Work on instruction-tuned Large Language Models (LLMs) has used automatic methods based on text overlap and LLM judgments as cost-effective alternatives to human evaluation. In this paper, we perform a meta-evaluation of such methods and assess their reliability across a broad range of tasks. In evaluating how well automatic methods align with human evaluations, correlation metrics are the most commonly employed method despite their inherent limitations when dealing with ties and different scales. To address these shortcomings, we use Pairwise Accuracy as an alternative to standard correlation measures. We observe that while automatic evaluation methods can approximate human ratings under specific conditions, their validity is highly context-dependent. Specifically, the simple ROUGE-L metric correlates very well with human ratings for short-answer English tasks but is unreliable in free-form generation tasks and cross-lingual scenarios. The effectiveness of the more advanced method of using GPT-4 as a judge diminishes significantly if reference answers are not included in the prompt, which is the scenario where this method has the potential to provide the most value compared to other metrics. Our findings enhance the understanding of how automatic methods should be applied and interpreted when developing and evaluating instruction-tuned LLMs.

National Category
Computer Sciences Natural Language Processing
Identifiers
urn:nbn:se:liu:diva-210271 (URN)
Conference
Findings of the Association for Computational Linguistics: EMNLP 2024
Available from: 2024-12-06 Created: 2024-12-06 Last updated: 2025-02-01Bibliographically approved
Kunz, J. & Kuhlmann, M. (2024). Properties and Challenges of LLM-Generated Explanations. In: Su Lin Blodgett, Amanda Cercas Curry, Sunipa Dey, Michael Madaio, Ani Nenkova, Diyi Yang, Ziang Xiao (Ed.), Proceedings of the Third Workshop on Bridging Human-Computer Interaction and Natural Language Processing: . Paper presented at Third Workshop on Bridging Human-Computer Interaction and Natural Language Processing at NAACL 2024.
Open this publication in new window or tab >>Properties and Challenges of LLM-Generated Explanations
2024 (English)In: Proceedings of the Third Workshop on Bridging Human-Computer Interaction and Natural Language Processing / [ed] Su Lin Blodgett, Amanda Cercas Curry, Sunipa Dey, Michael Madaio, Ani Nenkova, Diyi Yang, Ziang Xiao, 2024Conference paper, Poster (with or without abstract) (Refereed)
Abstract [en]

The self-rationalising capabilities of large language models (LLMs) have been explored in restricted settings, using task-specific data sets.However, current LLMs do not (only) rely on specifically annotated data; nonetheless, they frequently explain their outputs.The properties of the generated explanations are influenced by the pre-training corpus and by the target data used for instruction fine-tuning.As the pre-training corpus includes a large amount of human-written explanations “in the wild”, we hypothesise that LLMs adopt common properties of human explanations.By analysing the outputs for a multi-domain instruction fine-tuning data set, we find that generated explanations show selectivity and contain illustrative elements, but less frequently are subjective or misleading.We discuss reasons and consequences of the properties’ presence or absence. In particular, we outline positive and negative implications depending on the goals and user groups of the self-rationalising system.

Keywords
Natural Language Processing, Large Language Models, Explainability, Human-AI Interaction
National Category
Natural Language Processing
Identifiers
urn:nbn:se:liu:diva-207112 (URN)10.18653/v1/2024.hcinlp-1.2 (DOI)
Conference
Third Workshop on Bridging Human-Computer Interaction and Natural Language Processing at NAACL 2024
Available from: 2024-09-02 Created: 2024-09-02 Last updated: 2025-02-07
Holmström, O., Kunz, J. & Kuhlmann, M. (2023). Bridging the Resource Gap: Exploring the Efficacy of English and Multilingual LLMs for Swedish. In: Proceedings of the Second Workshop on Resources and Representations for Under-Resourced Languages and Domains (RESOURCEFUL-2023): . Paper presented at RESOURCEFUL workshop at NoDaLiDa (pp. 92-110). Tórshavn, the Faroe Islands
Open this publication in new window or tab >>Bridging the Resource Gap: Exploring the Efficacy of English and Multilingual LLMs for Swedish
2023 (English)In: Proceedings of the Second Workshop on Resources and Representations for Under-Resourced Languages and Domains (RESOURCEFUL-2023), Tórshavn, the Faroe Islands, 2023, p. 92-110Conference paper, Published paper (Refereed)
Abstract [en]

Large language models (LLMs) have substantially improved natural language processing (NLP) performance, but training these models from scratch is resource-intensive and challenging for smaller languages. With this paper, we want to initiate a discussion on the necessity of language-specific pre-training of LLMs. We propose how the “one model-many models” conceptual framework for task transfer can be applied to language transfer and explore this approach by evaluating the performance of non-Swedish monolingual and multilingual models’ performance on tasks in Swedish. Our findings demonstrate that LLMs exposed to limited Swedish during training can be highly capable and transfer competencies from English off-the-shelf, including emergent abilities such as mathematical reasoning, while at the same time showing distinct culturally adapted behaviour. Our results suggest that there are resourceful alternatives to language-specific pre-training when creating useful LLMs for small languages.

Place, publisher, year, edition, pages
Tórshavn, the Faroe Islands: , 2023
Keywords
NLP, Natural Language Processing, language model, GPT, monolingual, multilingual, cross-lingual, one model-many models
National Category
Natural Language Processing
Identifiers
urn:nbn:se:liu:diva-196545 (URN)
Conference
RESOURCEFUL workshop at NoDaLiDa
Funder
CUGS (National Graduate School in Computer Science)
Available from: 2023-08-11 Created: 2023-08-11 Last updated: 2025-02-07
Norlund, T., Doostmohammadi, E., Johansson, R. & Kuhlmann, M. (2023). On the Generalization Ability of Retrieval-Enhanced Transformers. In: Findings of the Association for Computational Linguistics: . Paper presented at EACL 2023, May 2-6, 2023 (pp. 1485-1493). ASSOC COMPUTATIONAL LINGUISTICS-ACL
Open this publication in new window or tab >>On the Generalization Ability of Retrieval-Enhanced Transformers
2023 (English)In: Findings of the Association for Computational Linguistics, ASSOC COMPUTATIONAL LINGUISTICS-ACL , 2023, p. 1485-1493Conference paper, Published paper (Refereed)
Abstract [en]

Recent work on the Retrieval-Enhanced Transformer (Retro) model has shown that offloading memory from trainable weights to a retrieval database can significantly improve language modeling and match the performance of non-retrieval models that are an order of magnitude larger in size. It has been suggested that at least some of this performance gain is due to non-trivial generalization based on both model weights and retrieval. In this paper, we try to better understand the relative contributions of these two components. We find that the performance gains from retrieval largely originate from over-lapping tokens between the database and the test data, suggesting less non-trivial generalization than previously assumed. More generally, our results point to the challenges of evaluating the generalization of retrieval-augmented language models such as Retro, as even limited token overlap may significantly decrease test-time loss. We release our code and model at https://github.com/TobiasNorlund/retro

Place, publisher, year, edition, pages
ASSOC COMPUTATIONAL LINGUISTICS-ACL, 2023
National Category
Natural Language Processing
Identifiers
urn:nbn:se:liu:diva-195609 (URN)001181085100107 ()9781959429470 (ISBN)
Conference
EACL 2023, May 2-6, 2023
Note

Funding Agencies|Wallenberg AI, Autonomous Systems and Software Program (WASP) - Knut and Alice Wallenberg Foundation; Swedish Research Council [2022-06725]

Available from: 2023-06-22 Created: 2023-06-22 Last updated: 2025-02-07Bibliographically approved
Doostmohammadi, E., Norlund, T., Kuhlmann, M. & Johansson, R. (2023). Surface-Based Retrieval Reduces Perplexity of Retrieval-Augmented Language Models. In: Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers): . Paper presented at 61st Annual Meeting of the the Association-for-Computational-Linguistics (ACL), Toronto, CANADA, jul 09-14, 2023 (pp. 521-529). ASSOC COMPUTATIONAL LINGUISTICS-ACL
Open this publication in new window or tab >>Surface-Based Retrieval Reduces Perplexity of Retrieval-Augmented Language Models
2023 (English)In: Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), ASSOC COMPUTATIONAL LINGUISTICS-ACL , 2023, p. 521-529Conference paper, Published paper (Refereed)
Abstract [en]

Augmenting language models with a retrieval mechanism has been shown to significantly improve their performance while keeping the number of parameters low. Retrieval-augmented models commonly rely on a semantic retrieval mechanism based on the similarity between dense representations of the query chunk and potential neighbors. In this paper, we study the state-of-the-art Retro model and observe that its performance gain is better explained by surface-level similarities, such as token overlap. Inspired by this, we replace the semantic retrieval in Retro with a surface-level method based on BM25, obtaining a significant reduction in perplexity. As full BM25 retrieval can be computationally costly for large datasets, we also apply it in a re-ranking scenario, gaining part of the perplexity reduction with minimal computational overhead.

Place, publisher, year, edition, pages
ASSOC COMPUTATIONAL LINGUISTICS-ACL, 2023
National Category
Applied Mechanics
Identifiers
urn:nbn:se:liu:diva-196564 (URN)001181088800045 ()9781959429715 (ISBN)
Conference
61st Annual Meeting of the the Association-for-Computational-Linguistics (ACL), Toronto, CANADA, jul 09-14, 2023
Note

Funding Agencies|Wallenberg AI, Autonomous Systems and Software Program (WASP) - Knut and Alice Wallenberg Foundation; Alvis - Swedish Research Council [2022-06725]; AliceWallenberg Foundation at the National Supercomputer Center

Available from: 2023-08-14 Created: 2023-08-14 Last updated: 2024-04-23
Kunz, J., Jirénius, M., Holmström, O. & Kuhlmann, M. (2022). Human Ratings Do Not Reflect Downstream Utility: A Study of Free-Text Explanations for Model Predictions. In: Jasmijn Bastings, Yonatan Belinkov, Yanai Elazar, Dieuwke Hupkes, Naomi Saphra, Sarah Wiegreffe (Ed.), Proceedings of the Fifth BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP: . Paper presented at BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP, December 8, 2022 (pp. 164-177). Association for Computational Linguistics (ACL), 5, Article ID 2022.blackboxnlp-1.14.
Open this publication in new window or tab >>Human Ratings Do Not Reflect Downstream Utility: A Study of Free-Text Explanations for Model Predictions
2022 (English)In: Proceedings of the Fifth BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP / [ed] Jasmijn Bastings, Yonatan Belinkov, Yanai Elazar, Dieuwke Hupkes, Naomi Saphra, Sarah Wiegreffe, Association for Computational Linguistics (ACL) , 2022, Vol. 5, p. 164-177, article id 2022.blackboxnlp-1.14Conference paper, Published paper (Refereed)
Abstract [en]

Models able to generate free-text rationales that explain their output have been proposed as an important step towards interpretable NLP for “reasoning” tasks such as natural language inference and commonsense question answering. However, the relative merits of different architectures and types of rationales are not well understood and hard to measure. In this paper, we contribute two insights to this line of research: First, we find that models trained on gold explanations learn to rely on these but, in the case of the more challenging question answering data set we use, fail when given generated explanations at test time. However, additional fine-tuning on generated explanations teaches the model to distinguish between reliable and unreliable information in explanations. Second, we compare explanations by a generation-only model to those generated by a self-rationalizing model and find that, while the former score higher in terms of validity, factual correctness, and similarity to gold explanations, they are not more useful for downstream classification. We observe that the self-rationalizing model is prone to hallucination, which is punished by most metrics but may add useful context for the classification step.

Place, publisher, year, edition, pages
Association for Computational Linguistics (ACL), 2022
Keywords
Large Language Models, Neural Networks, Transformers, Interpretability, Explainability
National Category
Natural Language Processing Computer Sciences
Identifiers
urn:nbn:se:liu:diva-195615 (URN)10.18653/v1/2022.blackboxnlp-1.14 (DOI)2-s2.0-85152897033 (Scopus ID)
Conference
BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP, December 8, 2022
Available from: 2023-06-22 Created: 2023-06-22 Last updated: 2025-02-01Bibliographically approved
Doostmohammadi, E. & Kuhlmann, M. (2022). On the Effects of Video Grounding on Language Models. In: Proceedings of the First Workshop on Performance and Interpretability Evaluations of Multimodal, Multipurpose, Massive-Scale Models: . Paper presented at First Workshop on Performance and Interpretability Evaluations of Multimodal, Multipurpose, Massive-Scale Models.
Open this publication in new window or tab >>On the Effects of Video Grounding on Language Models
2022 (English)In: Proceedings of the First Workshop on Performance and Interpretability Evaluations of Multimodal, Multipurpose, Massive-Scale Models, 2022Conference paper, Oral presentation with published abstract (Other academic)
Abstract [en]

Transformer-based models trained on text and vision modalities try to improve the performance on multimodal downstream tasks or tackle the problem Transformer-based models trained on text and vision modalities try to improve the performance on multimodal downstream tasks or tackle the problem of lack of grounding, e.g., addressing issues like models’ insufficient commonsense knowledge. While it is more straightforward to evaluate the effects of such models on multimodal tasks, such as visual question answering or image captioning, it is not as well-understood how these tasks affect the model itself, and its internal linguistic representations. In this work, we experiment with language models grounded in videos and measure the models’ performance on predicting masked words chosen based on their imageability. The results show that the smaller model benefits from video grounding in predicting highly imageable words, while the results for the larger model seem harder to interpret.of lack of grounding, e.g., addressing issues like models’ insufficient commonsense knowledge. While it is more straightforward to evaluate the effects of such models on multimodal tasks, such as visual question answering or image captioning, it is not as well-understood how these tasks affect the model itself, and its internal linguistic representations. In this work, we experiment with language models grounded in videos and measure the models’ performance on predicting masked words chosen based on their imageability. The results show that the smaller model benefits from video grounding in predicting highly imageable words, while the results for the larger model seem harder to interpret.

National Category
Natural Language Processing
Identifiers
urn:nbn:se:liu:diva-198261 (URN)
Conference
First Workshop on Performance and Interpretability Evaluations of Multimodal, Multipurpose, Massive-Scale Models
Available from: 2023-10-02 Created: 2023-10-02 Last updated: 2025-02-07Bibliographically approved
Kunz, J. & Kuhlmann, M. (2022). Where Does Linguistic Information Emerge in Neural Language Models?: Measuring Gains and Contributions across Layers. In: Nicoletta Calzolari, Chu-Ren Huang, Hansaem Kim, James Pustejovsky, Leo Wanner, Key-Sun Choi, Pum-Mo Ryu, Hsin-Hsi Chen, Lucia Donatelli, Heng Ji, Sadao Kurohashi, Patrizia Paggio, Nianwen Xue, Seokhwan Kim, Younggyun Hahm, Zhong He, Tony Kyungil Lee, Enrico Santus, Francis Bond, Seung-Hoon Na (Ed.), Proceedings of the 29th International Conference on Computational Linguistics: . Paper presented at COLING, October 12–17, 2022 (pp. 4664-4676). International Committee on Computational Linguistics, 29, Article ID 1.413.
Open this publication in new window or tab >>Where Does Linguistic Information Emerge in Neural Language Models?: Measuring Gains and Contributions across Layers
2022 (English)In: Proceedings of the 29th International Conference on Computational Linguistics / [ed] Nicoletta Calzolari, Chu-Ren Huang, Hansaem Kim, James Pustejovsky, Leo Wanner, Key-Sun Choi, Pum-Mo Ryu, Hsin-Hsi Chen, Lucia Donatelli, Heng Ji, Sadao Kurohashi, Patrizia Paggio, Nianwen Xue, Seokhwan Kim, Younggyun Hahm, Zhong He, Tony Kyungil Lee, Enrico Santus, Francis Bond, Seung-Hoon Na, International Committee on Computational Linguistics , 2022, Vol. 29, p. 4664-4676, article id 1.413Conference paper, Published paper (Refereed)
Abstract [en]

Probing studies have extensively explored where in neural language models linguistic information is located. The standard approach to interpreting the results of a probing classifier is to focus on the layers whose representations give the highest performance on the probing task. We propose an alternative method that asks where the task-relevant information emerges in the model. Our framework consists of a family of metrics that explicitly model local information gain relative to the previous layer and each layer’s contribution to the model’s overall performance. We apply the new metrics to two pairs of syntactic probing tasks with different degrees of complexity and find that the metrics confirm the expected ordering only for one of the pairs. Our local metrics show a massive dominance of the first layers, indicating that the features that contribute the most to our probing tasks are not as high-level as global metrics suggest.

Place, publisher, year, edition, pages
International Committee on Computational Linguistics, 2022
Series
Proceedings - International Conference on Computational Linguistics, COLING, ISSN 2951-2093
Keywords
NLP, AI, Language Technology, Computational Linguistics, Machine Learning
National Category
Natural Language Processing
Identifiers
urn:nbn:se:liu:diva-191000 (URN)2-s2.0-85162661424 (Scopus ID)
Conference
COLING, October 12–17, 2022
Available from: 2023-01-12 Created: 2023-01-12 Last updated: 2025-02-07Bibliographically approved
Kunz, J. & Kuhlmann, M. (2021). Test Harder Than You Train: Probing with Extrapolation Splits. In: Jasmijn Bastings, Yonatan Belinkov, Emmanuel Dupoux, Mario Giulianelli, Dieuwke Hupkes, Yuval Pinter, Hassan Sajjad (Ed.), Proceedings of the Fourth BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP: . Paper presented at BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP, November 11, 2021 (pp. 15-25). Punta Cana, Dominican Republic: Association for Computational Linguistics, 5, Article ID 2.
Open this publication in new window or tab >>Test Harder Than You Train: Probing with Extrapolation Splits
2021 (English)In: Proceedings of the Fourth BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP / [ed] Jasmijn Bastings, Yonatan Belinkov, Emmanuel Dupoux, Mario Giulianelli, Dieuwke Hupkes, Yuval Pinter, Hassan Sajjad, Punta Cana, Dominican Republic: Association for Computational Linguistics , 2021, Vol. 5, p. 15-25, article id 2Conference paper, Published paper (Refereed)
Abstract [en]

Previous work on probing word representations for linguistic knowledge has focused on interpolation tasks. In this paper, we instead analyse probes in an extrapolation setting, where the inputs at test time are deliberately chosen to be ‘harder’ than the training examples. We argue that such an analysis can shed further light on the open question whether probes actually decode linguistic knowledge, or merely learn the diagnostic task from shallow features. To quantify the hardness of an example, we consider scoring functions based on linguistic, statistical, and learning-related criteria, all of which are applicable to a broad range of NLP tasks. We discuss the relative merits of these criteria in the context of two syntactic probing tasks, part-of-speech tagging and syntactic dependency labelling. From our theoretical and experimental analysis, we conclude that distance-based and hard statistical criteria show the clearest differences between interpolation and extrapolation settings, while at the same time being transparent, intuitive, and easy to control.

Place, publisher, year, edition, pages
Punta Cana, Dominican Republic: Association for Computational Linguistics, 2021
Keywords
Natural Language Processing, Neural Language Models, Interpretability, Probing, BERT, Extrapolation
National Category
Natural Language Processing Computer Sciences
Identifiers
urn:nbn:se:liu:diva-182166 (URN)10.18653/v1/2021.blackboxnlp-1.2 (DOI)2-s2.0-85127226402 (Scopus ID)9781955917063 (ISBN)
Conference
BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP, November 11, 2021
Available from: 2022-01-10 Created: 2022-01-10 Last updated: 2025-02-01Bibliographically approved
Projects
Accurate and efficient non-projective dependency parsing [2008-00296_VR]; Uppsala University
Organisations
Identifiers
ORCID iD: ORCID iD iconorcid.org/0000-0002-2492-9872

Search in DiVA

Show all publications