liu.seSearch for publications in DiVA
Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • oxford
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
On Inter-Dataset Code Duplication and Data Leakage in Large Language Models
Linköping University, Department of Computer and Information Science, Software and Systems. Linköping University, Faculty of Science & Engineering.
McGill Univ, Canada.
Dalhousie Univ, Canada.
Dalhousie Univ, Canada.
Show others and affiliations
2025 (English)In: IEEE Transactions on Software Engineering, ISSN 0098-5589, E-ISSN 1939-3520, Vol. 51, no 1, p. 192-205Article in journal (Refereed) Published
Abstract [en]

Motivation. Large language models (LLMs) have exhibited remarkable proficiency in diverse software engineering (SE) tasks, such as code summarization, code translation, and code search. Handling such tasks typically involves acquiring foundational coding knowledge on large, general-purpose datasets during a pre-training phase, and subsequently refining on smaller, task-specific datasets as part of a fine-tuning phase. Problem statement. Data leakage i.e., using information of the test set to perform the model training, is a well-known issue in training of machine learning models. A manifestation of this issue is the intersection of the training and testing splits. While intra-dataset code duplication examines this intersection within a given dataset and has been addressed in prior research, inter-dataset code duplication, which gauges the overlap between different datasets, remains largely unexplored. If this phenomenon exists, it could compromise the integrity of LLM evaluations because of the inclusion of fine-tuning test samples that were already encountered during pre-training, resulting in inflated performance metrics. Contribution. This paper explores the phenomenon of inter-dataset code duplication and its impact on evaluating LLMs across diverse SE tasks. Study design. We conduct an empirical study using the CodeSearchNet dataset (csn), a widely adopted pre-training dataset, and five fine-tuning datasets used for various SE tasks. We first identify the intersection between the pre-training and fine-tuning datasets using a deduplication process. Next, we pre-train two versions of LLMs using a subset of csn: one leaky LLM, which includes the identified intersection in its pre-training set, and one non-leaky LLM that excludes these samples. Finally, we fine-tune both models and compare their performances using fine-tuning test samples that are part of the intersection. Results. Our findings reveal a potential threat to the evaluation of LLMs across multiple SE tasks, stemming from the inter-dataset code duplication phenomenon. We also demonstrate that this threat is accentuated by the chosen fine-tuning technique. Furthermore, we provide evidence that open-source models such as CodeBERT, GraphCodeBERT, and UnixCoder could be affected by inter-dataset duplication. Based on our findings, we delve into prior research that may be susceptible to this threat. Additionally, we offer guidance to SE researchers on strategies to prevent inter-dataset code duplication.

Place, publisher, year, edition, pages
IEEE COMPUTER SOC , 2025. Vol. 51, no 1, p. 192-205
Keywords [en]
Codes; Cloning; Training; Data models; Software development management; Tuning; Source coding; Software engineering; Context modeling; Transformers; Large language models (LLMs); inter-dataset code duplication; data leakage
National Category
Software Engineering
Identifiers
URN: urn:nbn:se:liu:diva-211208DOI: 10.1109/TSE.2024.3504286ISI: 001395714800008Scopus ID: 2-s2.0-85210125841OAI: oai:DiVA.org:liu-211208DiVA, id: diva2:1931908
Note

Funding Agencies|Wallenberg AI, Autonomous Systems and Software Program (WASP) - Knut and Alice Wallenberg Foundation; MCIN/AEI [PID2022-140109NB-I]; FEDER/UE

Available from: 2025-01-28 Created: 2025-01-28 Last updated: 2025-01-28

Open Access in DiVA

No full text in DiVA

Other links

Publisher's full textScopus

Search in DiVA

By author/editor
Hernández López, José AntonioVarro, Daniel
By organisation
Software and SystemsFaculty of Science & Engineering
In the same journal
IEEE Transactions on Software Engineering
Software Engineering

Search outside of DiVA

GoogleGoogle Scholar

doi
urn-nbn

Altmetric score

doi
urn-nbn
Total: 85 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • oxford
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf