liu.seSök publikationer i DiVA
Ändra sökning
Länk till posten
Permanent länk

Direktlänk
Publikationer (7 of 7) Visa alla publikationer
Lent, H., Tatariya, K., Dabre, R., Chen, Y., Fekete, M., Ploeger, E., . . . Bjerva, J. (2024). CreoleVal: Multilingual Multitask Benchmarks for Creoles. Transactions of the Association for Computational Linguistics, 12, 950-978
Öppna denna publikation i ny flik eller fönster >>CreoleVal: Multilingual Multitask Benchmarks for Creoles
Visa övriga...
2024 (Engelska)Ingår i: Transactions of the Association for Computational Linguistics, E-ISSN 2307-387X, Vol. 12, s. 950-978Artikel i tidskrift (Refereegranskat) Published
Abstract [en]

Creoles represent an under-explored and marginalized group of languages, with few available resources for NLP research. While the genealogical ties between Creoles and a number of highly resourced languages imply a significant potential for transfer learning, this potential is hampered due to this lack of annotated data. In this work we present CreoleVal, a collection of benchmark datasets spanning 8 different NLP tasks, covering up to 28 Creole languages; it is an aggregate of novel development datasets for reading comprehension relation classification, and machine translation for Creoles, in addition to a practical gateway to a handful of preexisting benchmarks. For each benchmark, we conduct baseline experiments in a zero-shot setting in order to further ascertain the capabilities and limitations of transfer learning for Creoles. Ultimately, we see CreoleVal as an opportunity to empower research on Creoles in NLP and computational linguistics, and in general, a step towards more equitable language technology around the globe.

Ort, förlag, år, upplaga, sidor
MIT Press, 2024
Nyckelord
natural language processing, Creole languages, benchmark, dataset
Nationell ämneskategori
Språkbehandling och datorlingvistik
Identifikatorer
urn:nbn:se:liu:diva-207509 (URN)10.1162/tacl_a_00682 (DOI)001305660400007 ()
Forskningsfinansiär
Vetenskapsrådet, 2022-06725
Tillgänglig från: 2024-09-10 Skapad: 2024-09-10 Senast uppdaterad: 2025-02-07
Bollmann, M., Schneider, N., Köhn, A. & Post, M. (2023). Two Decades of the ACL Anthology: Development, Impact, and Open Challenges. In: Proceedings of the 3rd Workshop for Natural Language Processing Open Source Software (NLP-OSS 2023): . Paper presented at Workshop for Natural Language Processing Open Source Software (NLP-OSS 2023), 2023-12-06, Singapore (pp. 83-94). Association for Computational Linguistics
Öppna denna publikation i ny flik eller fönster >>Two Decades of the ACL Anthology: Development, Impact, and Open Challenges
2023 (Engelska)Ingår i: Proceedings of the 3rd Workshop for Natural Language Processing Open Source Software (NLP-OSS 2023), Association for Computational Linguistics, 2023, s. 83-94Konferensbidrag, Publicerat paper (Refereegranskat)
Abstract [en]

The ACL Anthology is a prime resource for research papers within computational linguistics and natural language processing, while continuing to be an open-source and community-driven project. Since Gildea et al. (2018) reported on its state and planned directions, the Anthology has seen major technical changes. We discuss what led to these changes and how they impact long-term maintainability and community engagement, describe which open-source data and software tools the Anthology currently provides, and provide a survey of literature that has used the Anthology as a main data source.

Ort, förlag, år, upplaga, sidor
Association for Computational Linguistics, 2023
Nyckelord
ACL Anthology, open-source software
Nationell ämneskategori
Programvaruteknik
Identifikatorer
urn:nbn:se:liu:diva-207508 (URN)10.18653/v1/2023.nlposs-1.10 (DOI)
Konferens
Workshop for Natural Language Processing Open Source Software (NLP-OSS 2023), 2023-12-06, Singapore
Tillgänglig från: 2024-09-10 Skapad: 2024-09-10 Senast uppdaterad: 2024-09-10
Bollmann, M. & Søgaard, A. (2021). Error Analysis and the Role of Morphology. In: Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: . Paper presented at 16th Conference of the European Chapter of the Association for Computational Linguistics, April 19-23, 2021 (pp. 1887-1900).
Öppna denna publikation i ny flik eller fönster >>Error Analysis and the Role of Morphology
2021 (Engelska)Ingår i: Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics, 2021, s. 1887-1900Konferensbidrag, Publicerat paper (Refereegranskat)
Abstract [en]

We evaluate two common conjectures in error analysis of NLP models: (i) Morphology is predictive of errors; and (ii) the importance of morphology increases with the morphological complexity of a language. We show across four different tasks and up to 57 languages that of these conjectures, somewhat surprisingly, only (i) is true. Using morphological features does improve error prediction across tasks; however, this effect is less pronounced with morphologically complex languages. We speculate this is because morphology is more discriminative in morphologically simple languages. Across all four tasks, case and gender are the morphological features most predictive of error.

Nationell ämneskategori
Data- och informationsvetenskap
Identifikatorer
urn:nbn:se:liu:diva-197945 (URN)10.18653/v1/2021.eacl-main.162 (DOI)
Konferens
16th Conference of the European Chapter of the Association for Computational Linguistics, April 19-23, 2021
Tillgänglig från: 2023-09-18 Skapad: 2023-09-18 Senast uppdaterad: 2023-09-26
Bollmann, M., Aralikatte, R., Murrieta Bello, H., Hershcovich, D., de Lhoneux, M. & Søgaard, A. (2021). Moses and the Character-Based Random Babbling Baseline: CoAStaL at AmericasNLP 2021 Shared Task. In: Proceedings of the First Workshop on Natural Language Processing for Indigenous Languages of the Americas: . Paper presented at First Workshop on Natural Language Processing for Indigenous Languages of the Americas,June 11, 2021 (pp. 248-254).
Öppna denna publikation i ny flik eller fönster >>Moses and the Character-Based Random Babbling Baseline: CoAStaL at AmericasNLP 2021 Shared Task
Visa övriga...
2021 (Engelska)Ingår i: Proceedings of the First Workshop on Natural Language Processing for Indigenous Languages of the Americas, 2021, s. 248-254Konferensbidrag, Publicerat paper (Refereegranskat)
Abstract [en]

We evaluated a range of neural machine translation techniques developed specifically for low-resource scenarios. Unsuccessfully. In the end, we submitted two runs: (i) a standard phrase-based model, and (ii) a random babbling baseline using character trigrams. We found that it was surprisingly hard to beat (i), in spite of this model being, in theory, a bad fit for polysynthetic languages; and more interestingly, that (ii) was better than several of the submitted systems, highlighting how difficult low-resource machine translation for polysynthetic languages is.

Nationell ämneskategori
Data- och informationsvetenskap
Identifikatorer
urn:nbn:se:liu:diva-197946 (URN)10.18653/v1/2021.americasnlp-1.28 (DOI)
Konferens
First Workshop on Natural Language Processing for Indigenous Languages of the Americas,June 11, 2021
Tillgänglig från: 2023-09-18 Skapad: 2023-09-18 Senast uppdaterad: 2023-09-26
Bollmann, M. & Elliott, D. (2020). On Forgetting to Cite Older Papers: An Analysis of the ACL Anthology. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: . Paper presented at 58th Annual Meeting of the Association for Computational Linguistics, July5-10, 2020 (pp. 7819-7827).
Öppna denna publikation i ny flik eller fönster >>On Forgetting to Cite Older Papers: An Analysis of the ACL Anthology
2020 (Engelska)Ingår i: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 2020, s. 7819-7827Konferensbidrag, Publicerat paper (Refereegranskat)
Abstract [en]

The field of natural language processing is experiencing a period of unprecedented growth, and with it a surge of published papers. This represents an opportunity for us to take stock of how we cite the work of other researchers, and whether this growth comes at the expense of “forgetting” about older literature. In this paper, we address this question through bibliographic analysis. By looking at the age of outgoing citations in papers published at selected ACL venues between 2010 and 2019, we find that there is indeed a tendency for recent papers to cite more recent work, but the rate at which papers older than 15 years are cited has remained relatively stable.

Nationell ämneskategori
Data- och informationsvetenskap
Identifikatorer
urn:nbn:se:liu:diva-197947 (URN)10.18653/v1/2020.acl-main.699 (DOI)
Konferens
58th Annual Meeting of the Association for Computational Linguistics, July5-10, 2020
Tillgänglig från: 2023-09-18 Skapad: 2023-09-18 Senast uppdaterad: 2023-09-26
Bollmann, M. (2019). A Large-Scale Comparison of Historical Text Normalization Systems. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Paper presented at 2019 Conference of the North American Chapter of the Association for Computational Linguistics, Minneapolis,Minnesota,June 2-June 7,2019. (pp. 3885-3898). Association for Computational Linguistics
Öppna denna publikation i ny flik eller fönster >>A Large-Scale Comparison of Historical Text Normalization Systems
2019 (Engelska)Ingår i: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Association for Computational Linguistics, 2019, s. 3885-3898Konferensbidrag, Publicerat paper (Refereegranskat)
Abstract [en]

There is no consensus on the state-of-the-art approach to historical text normalization. Many techniques have been proposed, including rule-based methods, distance metrics, character-based statistical machine translation, and neural encoder–decoder models, but studies have used different datasets, different evaluation methods, and have come to different conclusions. This paper presents the largest study of historical text normalization done so far. We critically survey the existing literature and report experiments on eight languages, comparing systems spanning all categories of proposed normalization techniques, analysing the effect of training data quantity, and using different evaluation methods. The datasets and scripts are made publicly available.

Ort, förlag, år, upplaga, sidor
Association for Computational Linguistics, 2019
Nationell ämneskategori
Data- och informationsvetenskap
Identifikatorer
urn:nbn:se:liu:diva-197949 (URN)10.18653/v1/n19-1389 (DOI)
Konferens
2019 Conference of the North American Chapter of the Association for Computational Linguistics, Minneapolis,Minnesota,June 2-June 7,2019.
Tillgänglig från: 2023-09-18 Skapad: 2023-09-18 Senast uppdaterad: 2023-09-26
Bollmann, M., Korchagina, N. & Søgaard, A. (2019). Few-Shot and Zero-Shot Learning for Historical Text Normalization. In: Proceedings of the 2nd Workshop on Deep Learning Approaches for Low-Resource NLP (DeepLo 2019): . Paper presented at 2nd Workshop on Deep Learning Approaches for Low-Resource NLP (DeepLo 2019),HongKong,China,November 3,2019 (pp. 104-114). Association for Computational Linguistics
Öppna denna publikation i ny flik eller fönster >>Few-Shot and Zero-Shot Learning for Historical Text Normalization
2019 (Engelska)Ingår i: Proceedings of the 2nd Workshop on Deep Learning Approaches for Low-Resource NLP (DeepLo 2019), Association for Computational Linguistics, 2019, s. 104-114Konferensbidrag, Publicerat paper (Refereegranskat)
Abstract [en]

Historical text normalization often relies on small training datasets. Recent work has shown that multi-task learning can lead to significant improvements by exploiting synergies with related datasets, but there has been no systematic study of different multi-task learning architectures. This paper evaluates 63 multi-task learning configurations for sequence-to-sequence-based historical text normalization across ten datasets from eight languages, using autoencoding, grapheme-to-phoneme mapping, and lemmatization as auxiliary tasks. We observe consistent, significant improvements across languages when training data for the target task is limited, but minimal or no improvements when training data is abundant. We also show that zero-shot learning outperforms the simple, but relatively strong, identity baseline.

Ort, förlag, år, upplaga, sidor
Association for Computational Linguistics, 2019
Nationell ämneskategori
Data- och informationsvetenskap
Identifikatorer
urn:nbn:se:liu:diva-197948 (URN)10.18653/v1/d19-6112 (DOI)
Konferens
2nd Workshop on Deep Learning Approaches for Low-Resource NLP (DeepLo 2019),HongKong,China,November 3,2019
Tillgänglig från: 2023-09-18 Skapad: 2023-09-18 Senast uppdaterad: 2023-09-26
Organisationer
Identifikatorer
ORCID-id: ORCID iD iconorcid.org/0000-0003-2598-8150

Sök vidare i DiVA

Visa alla publikationer