Federated RDF query processing is concerned with querying a federation of RDF data sources where the queries are expressed using a declarative query language (typically, the RDF query language SPARQL), and the data sources are autonomous and heterogeneous. The current literature in this context assumes that the data and the data sources are semantically homogeneous, while heterogeneity occurs at the level of data formats and access protocols.
Federations of RDF data sources oer great potential for queries that cannot be answered by a single data source. However, querying such federations poses several challenges, one of which is that different but semantically-overlapping vocabularies may be used for the respective RDF data. Since the federation members usually retain their autonomy, this heterogeneity cannot simply be homogenized by modifying the data in the data sources. Therefore, handling this heterogeneity becomes a critical aspect of query planning and execution. We introduce an approach to address this challenge by leveraging vocabulary mappings for the processing of queries over federations with heterogeneous vocabularies. This approach not only translates SPARQL queries but also preserves the correctness of results during query execution. We demonstrate the effectiveness of the approach and measure how the application of vocabulary mappings affects on the performance of federated query processing.
Federated processing of queries over RDF data sources offers significant potential when a SPARQL query cannot be answered by a single data source alone. However, finding efficient plans to execute a queryover a federation is challenging, especially if different federation members provide different types of data access interfaces. Different interfaces imply different request types, different forms of responses, and different physical algorithms that can be used, each of which consumes varying amounts of resources during query execution. This heterogeneity poses additional obstacles to the task of planning query executions, in addition to the inherent complexity arising from numerous possible join orderings andvarious physical algorithms. As a first step to address these challenges, we propose a cost model that captures the resource requirements of different operators depending on the type of federation member,allowing us to estimate cost of a given query execution plan without actually executing it. To evaluate our approach, we conduct experiments on FedBench with our cost model and compare it to the current state-of-the-art approach to query planning for heterogeneous federations of RDF data sources.
Federations of RDF data sources provide great potential whenqueried for answers and insights that cannot be obtained from one data source alone. A challenge for planning the execution of queries over such a federation is that the federation may be heterogeneous in terms of the types of data access interfaces provided by the federation members. This challenge has not received much attention in the literature. This paper provides a solid formal foundation for future approaches that aim to address this challenge. Our main conceptual contribution is a formal language for representing query execution plans; additionally, we identify a fragment of this language that can be used to capture the result of selecting relevant data sources for different parts of a given query. As technical contributions, we show that this fragment is more expressive than what is supported by existing source selection approaches, which effectively highlights an inherent limitation of these approaches.Moreover, we show that the source selection problem is NP-hard and in ΣP2 , and we provide a comprehensive set of rewriting rules that can be used as a basis for query optimization.
GraphQL is a popular new approach to build Web APIs that enable clients to retrieve exactly the data they need. Given the growing number of tools and techniques for building GraphQL servers, there is an increasing need for comparing how particular approaches or techniques affect the performance of a GraphQL server. To this end, we present LinGBM, a GraphQL performance benchmark to experimentally study the performance achieved by various approaches for creating a GraphQL server. In this paper, we discuss the design considerations of the benchmark and describe its main components (data schema; query templates; performance metrics). Thereafter, we present experimental results obtained by applying the benchmark in two different use cases, which demonstrate the broad applicability of LinGBM.
Due to the OPTIONAL operator, the core fragment of the SPARQL query language is non-monotonic. That is, some solutions of a query result can be returned to the user only after having consulted all relevant parts of the queried dataset(s). This property presents an obstacle when developing query execution approaches that aim to reduce responses times rather than the overall query execution times. Reducing the response times?i.e., returning as many solutions as early as possible? is important in particular in Web-based client-server query processing scenarios in which network latencies dominate query execution times. Such scenarios are typical in the context of integration of Web data sources where a data integration component executes queries over a decentralized federation of such data sources. In this paper we introduce an alternative operator that is similar in spirit to OPTIONAL but without causing non-monotonicity. We show fundamental properties of this operator and observe that the downside of achieving the desired monotonicity property is a potentially significant increase in query result sizes. We study the extend of this trade-off in practice.
To answer queries over a federation of multiple data sources, a preliminary task is source selection; i.e.,identify the federation members that can answer each part of the query and decompose the query into subqueries assigned to each federation member. Existing source selection approaches for federations of RDF data sources have been developed based on the assumption that the federation member are SPARQL endpoints. This paper presents an analytical study that investigates whether these approaches are still effective in the context of federations that are heterogeneous in terms of the types of data access interface. In particular, we identify what information about the data of the federation members is required by the approaches and analyze the possibilities and the effort of obtaining this information via the different types of data access interfaces. We find that almost all existing source selection approaches can be adopted for heterogeneous federations but obtaining the required information may not be practical
A federation of RDF data sources offers enormous potential when answers or insights of queries are unavailable via a single data source. As various interfaces for accessing RDF data are proposed, one challenge for querying such a federation is that the federation members are heterogeneous in terms of the type of data access interfaces. There does not exist any research on systematic approaches to tackle this challenge. To provide a formal foundation for future approaches that aim to address this challenge, we have introduced a language, called FedQPL, that can be used for representing query execution plans in this setting. With a poster in the conference we generally want to outline the vision for the next generation of query engines for such federations and, in this context, we want to raise awareness in the Semantic Web community for our language. In this extended abstract, we first discuss challenges in query processing over such heterogeneous federations; thereafter, we briefly introduce our proposed language, which we have extended with a few new features that we did not have in the version published originally.
While several approaches to query a federation of SPARQL endpoints have been proposed in the literature, very little is known about the effectiveness of these approaches and the behavior of the resulting query engines for cases in which the number of federation members increases. The existing benchmarks that are typically used to evaluate SPARQL federation engines do not consider such a form of scalability. In this paper, we set out to close this knowledge gap by investigating the behavior of 4 state-of-the-art SPARQL federation engines using a novel benchmark designed for scalability experiments. Based on the benchmark, we show that scalability is a challenge for each of these engines, especially with respect to the effectiveness of their source selection & query decomposition approaches. FedShop is freely available online at:https://github.com/GDD-Nantes/FedShop
While several approaches to query a federation of SPARQL endpoints have been proposed in the literature, very little is known about the effectiveness of these approaches and the behavior of the resulting query engines for cases in which the number of federation members increases. The existing benchmarks that are typically used to evaluate SPARQL federation engines do not consider such a form of scalability. In this paper, we set out to close this knowledge gap by investigating the behavior of 4 state-of-the-art SPARQL federation engines using a novel benchmark designed for scalability experiments. Based on the benchmark, we show that scalability is a challenge for each of these engines, especially with respect to the effectiveness of their source selection & query decomposition approaches. FedShop is freely available online at: https://github.com/GDD-Nantes/FedShop.
This paper summarizes the content of the 28 tutorials that have been given at The Web Conference 2023.
The standard approach to annotate statements in RDF with metadatahas a number of shortcomings including data size blow-up and unnecessarilycomplicated queries. We propose an alternative approach that is based on nestingof RDF triples and of query patterns. The approach allows for a more compactrepresentation of data and queries, and it is backwards compatible with the standard.In this paper we present the formal foundations of our proposal and ofdifferent approaches to implement it. More specifically, we formally capture thenecessary extensions of the RDF data model and its query language SPARQL,and we define mappings based on which our extended notions can be convertedback to ordinary RDF and SPARQL. Additionally, for such type of mappings wedefine two desirable properties, information preservation and query result equivalence,and we show that the introduced mappings possess these properties.
The RDF*/SPARQL* approach extends RDF and SPARQL with means to capture and to query annotations of RDF triples, which is a feature that is natively available in graph databases modeled as Labeled Property Graphs (LPGs). Hence, the approach presents a step towards making the different graph database models interoperable. This paper takes this step further by providing a solid theoretical foundation for converting LPGs into RDF* data and for querying LPGs using the query language SPARQL*. Regarding the latter, the contributions in this paper consider approaches that materialize the RDF* representation of the LPGs into an RDF*-enabled triplestore as well as approaches in which the queried LPGs may reside in an LPG-specific database system.
The standard approach to annotate statements in RDF with metadata has a number of shortcomings including data size blow-up and complicated queries. We propose an alternative approach that is based on nesting of RDF triples and of query patterns. The approach allows for a more compact representation of data and queries, and it is backwards compatible with the standard.
The Triple Pattern Fragment (TPF) interface is a recent proposal for reducing server load in Web-based approaches to execute SPARQL queries over public RDF datasets. The price for less overloaded servers is a higher client-side load and a substantial increase in network load (in terms of both the number of HTTP requests and data transfer). In this paper, we propose a slightly extended interface that allows clients to attach intermediate results to triple pattern requests. The response to such a request is expected to contain triples from the underlying dataset that do not only match the given triple pattern (as in the case of TPF), but that are guaranteed to contribute in a join with the given intermediate result. Our hypothesis is that a distributed query execution using this extended interface can reduce the network load (in comparison to a pure TPF-based query execution) without reducing the overall throughput of the client-server system significantly. Our main contribution in this paper is twofold: we empirically verify the hypothesis and provide an extensive experimental comparison of our proposal and TPF.
After years of research and development, standards and technologiesfor semantic data are suciently mature to be usedas the foundation of novel data science projects that employsemantic technologies in various application domains such asbio-informatics, materials science, criminal intelligence, andsocial science. Typically, such projects are carried out bydomain experts who have a conceptual understanding of semantictechnologies but lack the expertise to choose and toemploy existing data management solutions for the semanticdata in their project. For such experts, including domainfocuseddata scientists, project coordinators, and projectengineers, our tutorial delivers a practitioner's guide to se-mantic data management. We discuss the following importantaspects of semantic data management and demonstratehow to address these aspects in practice by using mature,production-ready tools: i) storing and querying semanticdata; ii) understanding, iii) searching, and iv) visualizingthe data; v) automated reasoning; vi) integrating externaldata and knowledge; and vii) cleaning the data.
GraphQL is a highly popular new approach to build Web APIs. An important component of this approach is the GraphQL schema definition language (SDL). The original purpose of this language is to define a so-called GraphQL schema that specifies the types of objects that can be queried when accessing a specific GraphQL Web API. This paper focuses on the question: Can we repurpose this language to define schemas for graph databases that are based on the Property Graph model? This question is relevant because there does not exist a commonly adopted approach to define schemas for Property Graphs, and because the form in which GraphQL APIs represent their underlying data sources is very similar to the Property Graph model. To answer the question we propose an approach to adopt the GraphQL SDL for Property Graph schemas. We define this approach formally and show its fundamental properties such as the complexity of checking the satisfiability of schemas and of validating data against a schema.
GRADES-NDA is the premier workshop series on graph data management and analytics that aims to bring together researchers from academia, industry, and governmental organizations. GRADES-NDA'24 is a forum for discussing recent advances in (large-scale) graph data management and analytics systems, as well as proposing and discussing novel methods and techniques for addressing domain-specific challenges. In 2024, GRADES-NDA is in its seventh edition.
The Linked Data Fragment (LDF) framework has been proposed as a uniform view to explore the trade-offs of consuming Linked Data when servers provide (possibly many) different interfaces to access their data. Every such interface has its own particular properties regarding performance, bandwidth needs, caching, etc. Several practical challenges arise. For example, before exposing a new type of LDFs in some server, can we formally say something about how this new LDF interface compares to other interfaces previously implemented in the same server? From the client side, given a client with some restricted capabilities in terms of time constraints, network connection, or computational power, which is the best type of LDFs to complete a given task? Today there are only a few formal theoretical tools to help answer these and other practical questions, and researchers have embarked in solving them mainly by experimentation.In this paper we propose the Linked Data Fragment Machine (LDFM) which is the first formalization to model LDF scenarios. LDFMs work as classical Turing Machines with extra features that model the server and client capabilities. By proving formal results based on LDFMs, we draw a fairly complete expressiveness lattice that shows the interplay between several combinations of client and server capabilities. We also show the usefulness of our model to formally analyze the fine grain interplay between several metrics such as the number of requests sent to the server, and the bandwidth of communication between client and server.
This paper provides an overview of a model for capturing properties of client-server-based query computation setups. This model can be used to formally analyze different combinations of client and server capabilities, and compare them in terms of various fine-grain complexity measures. While the motivations and the focus of the presented work are related to querying the Semantic Web, the main concepts of the model are general enough to be applied in other contexts as well.
The Web of Linked Data is composed of tons of RDF documents interlinked to each other forming a huge repository of distributed semantic data. Effectively querying this distributed data source is an important open problem in the Semantic Web area. In this paper, we propose LDQL, a declarative language to query Linked Data on the Web. One of the novelties of LDQL is that it expresses separately (i) patterns that describe the expected query result, and (ii) Web navigation paths that select the data sources to be used for computing the result. We present a formal syntax and semantics, prove equivalence rules, and study the expressiveness of the language. In particular, we show that LDQL is strictly more expressive than all the query formalisms that have been proposed previously for Linked Data on the Web. We also study some computability issues regarding LDQL. We first prove that when considering the Web of Linked Data as a fully accessible graph, the evaluation problem for LDQL can be solved in polynomial time. Nevertheless, when the limited data access capabilities of Web clients are considered, the scenario changes drastically; there are LDQL queries for which a complete execution is not possible in practice. We formally study this issue and provide a sufficient syntactic condition to avoid this problem; queries satisfying this condition are ensured to have a procedure to be effectively evaluated over the Web of Linked Data. (C) 2016 Elsevier B.V. All rights reserved.
GraphQL is a recently proposed, and increasingly adopted, conceptual framework for providing a new type of data access interface on the Web. The framework includes a new graph query language whose semantics has been specified informally only. This has prevented the formal study of the main properties of the language. We embark on the formalization and study of GraphQL. To this end, we first formalize the semantics of GraphQL queries based on a labeled-graph data model. Thereafter, we analyze the language and show that it admits really efficient evaluation methods. In particular, we prove that the complexity of the GraphQL evaluation problem is NL-complete. Moreover, we show that the enumeration problem can be solved with constant delay. This implies that a server can answer a GraphQL query and send the response byte-by-byte while spending just a constant amount of time between every byte sent. Despite these positive results, we prove that the size of a GraphQL response might be prohibitively large for an internet scenario. We present experiments showing that current practical implementations suffer from this issue. We provide a solution to cope with this problem by showing that the total size of a GraphQL response can be computed in polynomial time. Our results on polynomial-time size computation plus the constant-delay enumeration can help developers to provide more robust GraphQL interfaces on the Web.
Linked Data on the Web represents an immense source of knowledge suitable to be automatically processed and queried. In this respect, there are different approaches for Linked Data querying that differ on the degree of centralization adopted. On one hand, the SPARQL query language, originally defined for querying single datasets, has been enhanced with features to query federations of datasets; however, this attempt is not sufficient to cope with the distributed nature of data sources available as Linked Data. On the other hand, extensions or variations of SPARQL aim to find trade-offs between centralized and fully distributed querying. The idea is to partially move the computational load from the servers to the clients. Despite the variety and the relative merits of these approaches, as of today, there is no standard language for querying Linked Data on theWeb. A specific requirement for such a language to capture the distributed, graph-like nature of Linked Data sources on the Web is a support of graph navigation. Recently, SPARQL has been extended with a navigational feature called property paths (PPs). However, the semantics of SPARQL restricts the scope of navigation via PPs to single RDF graphs. This restriction limits the applicability of PPs for querying distributed Linked Data sources on the Web. To fill this gap, in this paper we provide formal foundations for evaluating PPs on the Web, thus contributing to the definition of a query language for Linked Data. We first introduce a family of reachability-based query semantics for PPs that distinguish between navigation on the Web and navigation at the data level. Thereafter, we consider another, alternative query semantics that couples Web graph navigation and data level navigation; we call it context-based semantics. Given these semantics, we find that for some PP-based SPARQL queries a complete evaluation on the Web is not possible. To study this phenomenon we introduce a notion of Web-safeness of queries, and prove a decidable syntactic property that enables systems to identify queries that areWeb-safe. In addition to establishing these formal foundations, we conducted an experimental comparison of the context-based semantics and a reachability- based semantics. Our experiments show that when evaluating a PP-based query under the context-based semantics one experiences a significantly smaller number of dereferencing operations, but the computed query result may contain less solutions.
Facebook’s GraphQL is a recently proposed, and increasingly adopted,conceptual framework for providing a new type of data access interface on theWeb. The framework includes a new graph query language whose semantics hasbeen specified informally only. The goal of this paper is to understand the propertiesof this language. To this end, we first provide a formal query semantics.Thereafter, we analyze the language and show that it has a very low complexityfor evaluation. More specifically, we show that the combined complexity ofthe main decision problems is in NL (Nondeterministic Logarithmic Space) and,thus, they can be solved in polynomial time and are highly parallelizable.
The traversal-based approach to execute queries over Linked Data on the WWW fetches data by traversing data links and, thus, is able to make use of up-to-date data from initially unknown data sources. While the downside of this approach is the delay before the query engine completes a query execution, user perceived response time may be improved significantly by returning as many elements of the result set as soon as possible. To this end, the query engine requires a traversal strategy that enables the engine to fetch result-relevant data as early as possible. The challenge for such a strategy is that the query engine does not know a priori which of the data sources discovered during the query execution will contain result-relevant data. In this paper, we investigate 14 different approaches to rank traversal steps and achieve a variety of traversal strategies. We experimentally study their impact on response times and compare them to a baseline that resembles a breadth-first traversal. While our experiments show that some of the approaches can achieve noteworthy improvements over the baseline in a significant number of cases, we also observe that for every approach, there is a non-negligible chance to achieve response times that are worse than the baseline.
GRADES-NDA is the premier workshop series on graph data management and analytics that aims to bring together researchers from academia, industry, and government. GRADES-NDA'23 is a forum for discussing recent advances in (large-scale) graph data management and analytics systems, as well as proposing and discussing novel methods and techniques for addressing domain-specific challenges or handling noise in real-world graphs. In 2023, GRADES-NDA is in its sixth edition.
RDF Stream Processing (RSP) has been proposed as a candidate for bringing together the Complex Event Processing (CEP) paradigm and the Semantic Web standards. In this paper, we investigate the impact of explicitly representing and processing uncertainty in RSP for the use in CEP. Additionally, we provide a representation for capturing the relevant notions of uncertainty in the RSP-QL* data model and describe query functions that can operate on this representation. The impact evaluation is based on a use case within electronic healthcare, where we compare the query execution overhead of different uncertainty options in a prototype implementation. The experiments show that the influence on query execution performance varies greatly, but that uncertainty can have noticeable impact on query execution performance. On the otherhand, the overhead grows linearly with respect to the stream rate for all uncertainty options in the evaluation, and the observed performance is sufficient for many use cases. Extending the representation and operations to support more uncertainty options and investigating different query optimization strategies to reduce the impact on execution performance remain important areas for future research.
RDF Stream Processing (RSP) has been proposed as a way of bridging the gap between the Complex Event Processing (CEP) paradigm and the Semantic Web standards. Uncertainty has been recognized as a critical aspect in CEP, but it has received little attention within the context of RSP. In this paper, we investigate the impact of different RSP optimization strategies for uncertainty management. The paper describes (1) an extension of the RSP-QL* data model to capture bind expressions, filter expressions, and uncertainty functions; (2) optimization techniques related to lazy variables and caching of uncertainty functions, and a heuristic for reordering uncertainty filters in query plans; and (3) an evaluation of these strategies in a prototype implementation. The results show that using a lazy variable mechanism for uncertainty functions can improve query execution performance by orders of magnitude while introducing negligible overhead. The results also show that caching uncertainty function results can improve performance under most conditions, but that maintaining this cache can potentially add overhead to the overall query execution process. Finally, the effect of the proposed heuristic on query execution performance was shown to depend on multiple factors, including the selectivity of uncertainty filters, the size of intermediate results, and the cost associated with the evaluation of the uncertainty functions.
RSP-QL was developed by the W3C RDF Stream Processing (RSP) community group as a common way to express and query RDF streams. However, RSP-QL does not provide any way of annotating data on the statement level, for example, to express the uncertainty that is often associated with streaming information. Instead, the only way to provide such information has been to use RDF reification, which adds additional complexity to query processing, and is syntactically verbose. In this paper, we define an extension of RSP-QL, called RSP-QL*, that provides an intuitive way for supporting statement-level annotations in RSP. The approach leverages the concepts previously described for RDF* and SPARQL*. We illustrate the proposed approach based on a scenario from a research project in e-health. An open-source implementation of the proposal is provided and compared to the baseline approach of using RDF reification. The results show that this way of dealing with statement-level annotations offers advantages with respect to both data transfer bandwidth and query execution performance.
Process mining has significantly transformed business process management by introducing innovative data-based analysis techniques and empowering organizations to unveil hidden insights previously buried within their recorded data. The analysis is conducted on event logs structured by conceptual models. Traditional models were defined based on only a single case notion, e.g., order or item in the purchase process. This limitation hinders the application of process mining in practice for which new data models are developed, a.k.a, Event Knowledge Graph (EKG) and Object-Centric Event Log (OCEL).While several tools have been developed for OCEL, there is a lack of process mining tooling around the EKG. In addition, there is a lack of comparison about the practical implication of choosing one approach over another. To fill this gap, the contribution of this paper is threefold.First, it defines and implements an algorithm to transform event logs represented as EKG to OCEL. The implementation is used to transform 5 real event logs based on which the approach is evaluated. Second, it compares the performance of analyzing event logs represented in these two models. Third, it compares and reveals similarities and differences in analyzing processes based on event logs represented in these two models.The results highlight ten important findings, including different approaches in calculating directly-follows relations when analyzing filtered event logs in these models and the limitations of OCEL in supporting event lifecycle and inter-log relation analysis.
Today's space of graph database solutions is characterized by two main technology stacks that have evolved separate from one another: on one hand, there are systems that focus on supporting the RDF family of standards; on the other hand, there is the Property Graph category of systems. As a basis for bringing these stacks together and, in particular, to facilitate data exchange between the different types of systems, different direct mappings between the underlying graph data models have been introduced in the literature. While fundamental properties are well-documented for most of these mappings, the same cannot be said about the practical implications of choosing one mapping over another. Our research aims to contribute towards closing this gap. In this paper we report on a preliminary study for which we have selected two direct mappings from (Labeled) Property Graphs to RDF, where one of them uses features of the RDF-star extension to RDF. We compare these mappings in terms of the query performance achieved by two popular commercial RDF stores, GraphDB and Stardog, in which the converted data is imported. While we find that, for both of these systems, none of the mappings is a clear winner in terms of guaranteeing better query performance, we also identify types of queries that are problematic for the systems when using one mapping but not the other.
GraphQL is a query language for APIs that has been increasingly adopted by web developers since its specification was open sourced in 2015. The GraphQL framework lets API clients tailor data requests by using queries that return JSON objects described using GraphQL Schema. We present initial results of an exploratory empirical study with the goal of characterizing GraphQL Schemas in open code repositories and package registries. Our first approach identifies over 20 thousand GraphQL-related projects in publically accessible repositories hosted by GitHub. Our second, and complementary, approach uses package registries to select 30 GraphQL “reference” packages (the ones with the highest dependency counts), and then finds their 90 thousand dependent packages (and the related repositories in GitHub, GitLab, and Bitbucket). In addition, over 2 thousand schema files were loaded into the GraphQL.js reference implementation to conduct a detailed analysis of the schema information. Our study provides insights into the usage of different schema constructs, the number of distinct types and the most popular types in schemas, as well as the presence of cycles in schemas.
Many datasets change over time. As a consequence, long-running applications that cache and repeatedly use query results obtained from a SPARQL endpoint may resubmit the queries regularly to ensure up-to-dateness of the results. While this approach may be feasible if the number of such regular refresh queries is manageable, with an increasing number of applications adopting this approach, the SPARQL endpoint may become overloaded with such refresh queries. A more scalable approach would be to use a middle-ware component at which the applications register their queries and get notified with updated query results once the results have changed. Then, this middle-ware can schedule the repeated execution of the refresh queries without overloading the endpoint. In this paper, we study the problem of scheduling refresh queries for a large number of registered queries by assuming an overload-avoiding upper bound on the length of a regular time slot available for testing refresh queries. We investigate a variety of scheduling strategies and compare them experimentally in terms of time slots needed before they recognize changes and number of changes that they miss.
In the materials design domain, much of the data from materials calculations is stored in different heterogeneous databases with different data and access models. Therefore, accessing and integrating data from different sources is challenging. As ontology-based access and integration alleviates these issues, in this paper we address data access and interoperability for computational materials databases by developing the Materials Design Ontology. This ontology is inspired by and guided by the OPTIMADE effort that aims to make materials databases interoperable and includes many of the data providers in computational materials science. In this paper, first, we describe the development and the content of the Materials Design Ontology. Then, we use a topic model-based approach to propose additional candidate concepts for the ontology. Finally, we show the use of the Materials Design Ontology by a proof-of-concept implementation of a data access and integration system for materials databases based on the ontology.
Amazon Neptune is a graph database service that supports two graph models: W3Cs Resource Description Framework (RDF) and Labeled Property Graphs (LPG). Customers choose one or the other model. This choice determines which data modeling features can be used and - perhaps more importantly - which query languages are available. The choice between the two technology stacks is difficult and time consuming. It requires consideration of data modeling aspects, query language features, their adequacy for current and future use cases, as well as developer knowledge. Even in cases where customers evaluate the pros and cons and make a conscious choice that fits their use case, over time we often see requirements from new use cases emerge that could be addressed more easily with a different data model or query language. It is therefore highly desirable that the choice of the query language can be made without consideration of what graph model is chosen and can be easily revised or complemented at a later point. To this end, we advocate and explore the idea of OneGraph ("1G" for short), a single, unified graph data model that embraces both RDF and LPGs. The goal of 1G is to achieve interoperability at both data level, by supporting the co-existence of RDF and LPG in the same database, as well as query level, by enabling queries and updates over the unified data model with a query language of choice. In this paper, we sketch our vision and investigate technical challenges towards a unification of the two graph data models.
A GraphQL server contains two building blocks: (1) a GraphQL schema defining the types of data objects that can be requested; (2) resolver functions fetching the relevant data from underlying data sources. GraphQL can be used for data integration if the GraphQL schema provides an integrated view of data from multiple data sources, and the resolver functions are implemented accordingly.However, there does not exist a semantics-aware approach to use GraphQL for data integration.We proposed a framework using GraphQL for data integration in which a global domain ontology informs the generation of a GraphQL server. Furthermore, we implemented a prototype of this framework, OBG-gen. In this paper, we demonstrate OBG-gen in a real-world data integration scenario in the materials design domain and in a synthetic benchmark scenario.
In a GraphQL Web API, a so-called GraphQL schema defines the types of data objects that can be queried, and so-called resolver functions are responsible for fetching the relevant data from underlying data sources. Thus, we can expect to use GraphQL not only for data access but also for data integration, if the GraphQL schema reflects the semantics of data from multiple data sources, and the resolver functions can obtain data from these data sources and structure the data according to the schema. However, there does not exist a semantics-aware approach to employ GraphQL for data integration. Furthermore, there are no formal methods for defining a GraphQL API based on an ontology.In this work, we introduce a framework for using GraphQL in which a global domain ontology informs the generation of a GraphQL server that answers requests by querying heterogeneous data sources.The core of this framework consists of an algorithm to generate a GraphQL schema based on an ontology and a generic resolver function based on semantic mappings. We provide a prototype, OBG-gen, of this framework, and we evaluate our approach over a real-world data integration scenario in the materials design domain and two synthetic benchmark scenarios (Linköping GraphQL Benchmark and GTFS-Madrid-Bench). The experimental results of our evaluation indicate that: (i) our approach is feasible to generate GraphQL servers for data access and integration over heterogeneous data sources, thus avoiding a manual construction of GraphQL servers, and (ii) our data access and integration approach is general and applicable to different domains where data is shared or queried via different ways.
The runtime optimization of federated SPARQL query engines is of central importance to ensure the usability of the Web of Data in real-world applications. The efficient selection of sources (SPARQL endpoints in our case) as well as the generation of optimized query plans belong to the most important optimization steps in this respect. This paper presents CostFed, an index-assisted federation engine for federated SPARQL query processing over multiple SPARQL endpoints. CostFed makes use of statistical information collected from endpoints to perform efficient source selection and cost-based query planning. In contrast to the state of the art, it relies on a non-linear model for the estimation of the selectivity of joins. Therewith, it is able to generate better plans than the state-of-the-art federation engines. In an experimental evaluation based on FedBench benchmark, we show that CostFed is 3 to 121 times faster than the state of the art SPARQL endpoint federation engines.