Enabling Natural Language Interaction With Decentralized Data Sources Using Large Language Models
2025 (English)Independent thesis Advanced level (degree of Master (Two Years)), 20 credits / 30 HE credits
Student thesisAlternative title
Möjliggörande av interaktion på naturligt språk med decentraliserade datakällor med hjälp av stora språkmodeller (Swedish)
Abstract [en]
Applications powered by Large Language Models (LLMs) can lower the barrier toquerying decentralized data, where using SPARQL remains difficult for non-technical users. This thesis explores how LLMs can be leveraged to facilitate natural language (NL) interaction with decentralized data systems. A bi-directional NL to SPARQL translation pipeline was designed and implemented, then integrated with a simple user interface (UI). The evaluation of the pipeline, involved using a gold-standard dataset of 21 competency questions derived from the Onto-DESIDE project, with three independent runs per setting to account for LLM variability. For NL to SPARQL, the metrics included execution success and result accuracy (identical, underfetching, overfetching and both underfetching and overfetching). For SPARQL to NL, the pipeline was rated based on Clarity and Faithfulness on a Likert scale, comparing zero shot and one shot prompting strategies. Acrossthe runs, the LLM consistently produced a syntactically valid SPARQL (execution success>95% for both zero shot and one shot), yet semantic accuracy was low: only around a fifth of the LLM-generated SPARQL queries matched the goldstandard results, with frequent overfetching and/or underfetching. Surprisingly, one shot prompting did not improve semantic accuracy for NL to SPARQL translation. However, SPARQL to NL explanations were clear and highly faithful, and one shot prompting further improved clarity and reduced variability. UNION and FILTER NOT EXISTS were identified as categories where the LLM performed worse in the zero shot approach. Limitations of the thesis, include ambiguity in NL questions, LLM’s tendency to return URIs rather than labels and the non-production ready solution of including ontologies in the prompts. The findings indicate that the SPARQL to NL module is immediately useful for explaining SPARQL queries, whereas the NL to SPARQL module is not yet reliable for unsupervised use. Future workinclude refining prompt-engineering strategies, developing dynamic context management for ontology retrieval and implementing a post-processing validation layer to improve performance.
Place, publisher, year, edition, pages
2025. , p. 74
Keywords [en]
Large Language Models, LLMs, NL to SPARQL, SPARQL to NL, Solid Pods, Decentralized Data Access, Semantic Web, RDF Data
National Category
Natural Language Processing Computer and Information Sciences Natural Language Processing
Identifiers
URN: urn:nbn:se:liu:diva-219592ISRN: LIU-IDA/LITH-EX-A--25/102--SEOAI: oai:DiVA.org:liu-219592DiVA, id: diva2:2015002
Subject / course
Computer Engineering
Presentation
2025-09-30, Alan Turing, LIU, Linköping, 08:00 (Swedish)
Supervisors
Examiners
2025-11-242025-11-192025-11-24Bibliographically approved