The Smart Data Analytics Group (UBO) is very pleased to announce that the group got 14 papers accepted for presentation at ISWC 2019 – the 18th International Semantic Web Conference, October 26 – 30 2019 in Auckland, New Zealand.
The International Semantic Web Conference (ISWC) is the premier international forum where Semantic Web / Linked Data / Knowledge Graph researchers, practitioners, and industry specialists come together to discuss, advance, and shape the future of semantic technologies on the web, within enterprises and in the context of the public institution.ISWC is an A-ranked conference (CORE ranking) and currently 11th in Google Scholar in the category “Databases & Information Systems” with an h5-index of 41 as well as 4th in terms WWW related conferences in MS Academic Search.
Here is the list of the accepted papers with their abstract:
- “A Scalable Framework for Quality Assessment of RDF Datasets” by Gezim Sejdiu, Anisa Rula, Jens Lehmann, and Hajira Jabeen (Resources track).
Topic: Scalability, Data Quality
Abstract: Over the last years, Linked Data has grown continuously. Today, we count more than 10,000 datasets being available online following Linked Data standards. These standards allow data to be machine readable and inter-operable. Nevertheless, many applications, such as data integration, search, and interlinking, cannot take full advantage of Linked Data if it is of low quality. There exist a few approaches for the quality assessment of Linked Data, but their performance degrades with the increase in data size and quickly grows beyond the capabilities of a single machine. In this paper, we present DistQualityAssessment — an open source implementation of quality assessment of large RDF datasets that can scale out to a cluster of machines. This is the first distributed, in-memory approach for computing different quality metrics for large RDF datasets using Apache Spark. We also provide a quality assessment pattern that can be used to generate new scalable metrics that can be applied to big data. The work presented here is integrated with the SANSA framework and has been applied to at least three use cases beyond the SANSA community. The results show that our approach is more generic, efficient, and scalable as compared to previously proposed approaches.
- “Sparklify: A Scalable Software Component for Efficient evaluation of SPARQL queries over distributed RDF datasets” by Claus Stadler, Gezim Sejdiu, Damien Graux, and Jens Lehmann (Resources track).
Topic: Scalability, KG Querying
Abstract: One of the key traits of Big Data is its complexity in terms of representation, structure, or formats. One existing way to deal with it is offered by Semantic Web standards. Among them, RDF –which proposes to model data with triples representing edges in a graph– has received a large success and the semantically annotated data has grown steadily towards a massive scale. Therefore, there is a need for scalable and efficient query engines capable of retrieving such information. In this paper, we propose \emph{Sparklify}: a scalable software component for efficient evaluation of SPARQL queries over distributed RDF datasets. It uses Sparqlify as a SPARQL-to-SQL rewriter for translating SPARQL queries into Spark executable code. Our preliminary results demonstrate that our approach is more extensible, efficient, and scalable as compared to state-of-the-art approaches. Sparklify is integrated into a larger SANSA framework and it serves as a default query engine and has been used by at least three external use scenarios.
- “Squerall: Virtual Ontology-Based Access to Heterogeneous and Large Data Sources” by Mohamed Nadjib Mami, Damien Graux, Simon Scerri, Hajira Jabeen, Sören Auer, and Jens Lehmann (Resources track).
Topic: Scalability, Querying
Abstract: The last two decades witnessed a remarkable evolution in terms of data formats, modalities, and storage capabilities. Instead of having to adapt one’s application needs to the, earlier limited, available storage options, today there is a wide array of options to choose from to best meet an application’s needs. This has resulted in vast amounts of data available in a variety of forms and formats which, if interlinked and jointly queried, can generate valuable knowledge and insights. In this article, we describe Squerall: a framework that builds on the principles of Ontology-Based Data Access (OBDA) to enable the querying of disparate heterogeneous sources using a unique query language, SPARQL. In Squerall, original data is queried on-the-fly without prior data materialization or transformation. In particular, Squerall allows the aggregation and joining of large data in a distributed manner. Squerall supports out-of-the-box five data sources and moreover, it can be programmatically extended to cover more sources and incorporate new query engines. The framework provides user interfaces for the creation of necessary inputs, as well as guiding non-SPARQL experts to write SPARQL queries. Squerall is integrated into the popular SANSA stack and available as open-source software via GitHub and as a Docker image
- “Entity Enabled Relation Linking” by Jeff J Pan, Mei Zhang, Kuldeep Singh, Frank Van Harmelen, Jinguang Gu, and Zhi Zhang (Research track).
Topic: QA/KG Querying
Abstract: Relation linking is an important problem for knowledge graph-based Question Answering. Given a natural language question and a knowledge graph, the task is to identify relevant relations from the given knowledge graph. Since existing techniques for entity extraction and linking are more stable compared to relation linking, our idea is to exploit entities extracted from the question to support relation linking. In this paper, we propose a novel approach, based on DBpedia entities, for computing relation candidates. We have empirically evaluated our approach on different standard benchmarks. Our evaluation shows that our approach significantly outperforms existing baseline systems in both recall, precision and runtime.
- “QaldGen: Towards Microbenchmarking of Question Answering Systems over Knowledge Graph” by Kuldeep Singh, Muhammad Saleem, Abhishek Nadgeri, Felix Conrads, Jeff Pan, Axel-Cyrille Ngonga Ngomo, Jens Lehmann (Resources track).
Topic: QA
Abstract: Over the last years, a number of Linked Data-based Question Answering (QA) systems have been developed. Consequently, the series of Question Answering Over Linked Data (QALD1–QALD9) challenges and other datasets have been proposed to evaluate these systems. However, the QA datasets contain a fixed number of natural language questions and do not allow users to generate micro benchmarks tailored towards specific use-cases. We propose QaldGen, a natural language benchmark generation framework for Knowledge Graphs which is able to generate customised QA benchmark from existing QA repositories. The framework is flexible enough to generate benchmarks of varying sizes and according to the user-defined criteria on the most important features to be considered for QA benchmarking. This is achieved using different clustering algorithms. We compare state-of-the-art QA systems over knowledge graphs by using different QA benchmarks. The observed results show that specialised micro-benchmarking is important to pinpoint the limitations of the various components of QA systems.
- “Incorporating Literals into Knowledge Graph Embeddings” by Agustinus Kristiadi, Mohammad Asif Khan, Denis Lukovnikov, Jens Lehmann and Asja Fischer (Research track).
Topic: KG Embeddings
Abstract: Knowledge graphs are composed of different elements: entity nodes, relation edges, and literal nodes. Each literal node contains an entity’s attribute value (e.g. the height of an entity of type person) and thereby encodes information which in general cannot be represented by relations between entities alone. However, most of the existing embedding or latent-feature-based methods for knowledge graph analysis only consider entity nodes and relation edges, and thus do not take the information provided by literals into account. In this paper, we extend existing latent feature methods for link prediction by a simple portable module for incorporating literals, which we name LiteralE. Unlike in concurrent methods where literals are incorporated by adding a literal-dependent term to the output of the scoring function and thus only indirectly affect the entity embeddings, LiteralE directly enriches these embeddings with information from literals via a learnable parameterized function. This function can be easily integrated into the scoring function of existing methods and learned along with the entity embeddings in an end-to-end manner. In an extensive empirical study over three datasets, we evaluate LiteralE-extended versions of various state-of-the-art latent feature methods for link prediction and demonstrate that LiteralE presents an effective way to improve their performance. For these experiments, we augmented standard datasets with their literals, which we publicly provide as testbeds for further research. Moreover, we show that LiteralE leads to a qualitative improvement of the embeddings and that it can be easily extended to handle literals from different modalities.
- “SemanGit: A Linked Dataset from git” by Dennis Oliver Kubitza, Matthias Böckmann, and Damien Graux (Resources track).
Topic: Data Modelling
Abstract: The growing interest in free and open-source software which occurred over the last decades has accelerated the usage of versioning systems to help developers collaborating together in the same projects. As a consequence, specific tools such as git and specialized open-source on-line platforms gained importance. In this study, we introduce and share SemanGit which provides a resource at the crossroads of both Semantic Web and git web-based version control systems. SemanGit is actually the first collection of linked data extracted from GitHub based on a git ontology we designed and extended to include specific GitHub features. In this article, we present the dataset, describe the extraction process according to the ontology, show some promising analyses of the data and outline how SemanGit could be linked with external datasets or enriched with new sources to allow for more complex analyses.
- “SEO: A Scientific Events Data Model” by Said Fathallah, Sahar Vahdati, Christoph Lange, and Sören Auer (Resources track).
Topic: Data Modelling
Abstract: Scientific events have become a key factor of scholarly com- munication for many scientific domains. They are considered as the focal point for establishing scientific relations between scholarly objects such as people (e.g., chairs, participants), places (e.g., location), actions (e.g., roles of participants), and artifacts (e.g., proceedings) in the scholarly communication domain. Metadata of scientific events have been made available in unstructured or semi-structured formats, which hides the interconnected and complex relationships between them and prevents transparency. To facilitate the management of such metadata, the repres- entation of event-related information in an interoperable form requires a uniform conceptual modeling. The Scientific Events Ontology (OR-SEO) has been engineered to represent metadata of scientific events. We describe a systematic redesign of the information model that is used as a schema for the event pages of the OpenResearch.org community wiki, reusing well-known vocabularies to make OR-SEO interoperable in different contexts. OR-SEO is now in use on thousands of Open- Research.org events pages, which enables users to represent structured knowledge about events without tackling technical implementation challenges and ontology development.
- “The KEEN Universe: An Ecosystem for Knowledge Graph Embeddings with a Focus on Reproducibility and Transferability” by Mehdi Ali, Hajira Jabeen, Charles Tapley Hoyt and Jens Lehmann (Resources track).
Topic: KG Embeddings
Abstract: There is an emerging trend of embedding knowledge graphs (KGs) in continuous vector spaces in order to use those for machine learning tasks. Recently, many knowledge graph embedding (KGE) models have been proposed that learn low dimensional representations while trying to maintain the structural properties of the KGs such as the similarity of nodes depending on their edges to other nodes. KGEs can be used to address tasks within KGs such as the prediction of novel links and the disambiguation of entities. They can also be used for downstream tasks like question answering and fact-checking. Overall, these tasks are relevant for the semantic web community. Despite their popularity, the reproducibility of KGE experiments and the transferability of proposed KGE models to research fields outside the machine learning community can be a major challenge. Therefore, we present the KEEN Universe, an ecosystem for knowledge graph embeddings that we have developed with a strong focus on reproducibility and transferability. The KEEN Universe currently consists of the Python packages PyKEEN (Python KnowlEdge Graph EmbeddiNgs), BioKEEN (Biological KnowlEdge Graph EmbeddiNgs), and the KEEN Model Zoo for sharing trained KGE models with the community.
- “DBpedia FlexiFusion – Best of Wikipedia > Wikidata > Your Data” by Johannes Frey, Marvin Hofer, Daniel Obraczka, Jens Lehmann and Sebastian Hellmann (Resources track).
Topic: Data Integration
Abstract: The data quality improvement of DBpedia has been in the focus of many publications in the past years with topics covering both knowledge enrichment techniques such as type learning, taxonomy generation, interlinking as well as error detection strategies such as property or value outlier detection, type checking, ontology constraints, or unit-tests, to name just a few. The concrete innovation of the DBpedia FlexiFusion workflow, leveraging the novel DBpedia PreFusion dataset, which we present in this paper, is to massively cut down the engineering workload to apply any of the vast methods available in shorter time and also make it easier to produce customized knowledge graphs or DBpedias. While FlexiFusion is flexible to accommodate other use cases, our main use case in this paper is the generation of richer, language-specific DBpedias for the 20+ DBpedia chapters, which we demonstrate on the Catalan DBpedia. In this paper, we define a set of quality metrics and evaluate them for Wikidata and DBpedia datasets of several language chapters. Moreover, we show that an implementation of FlexiFusion, performed on the proposed PreFusion dataset, increases data size, richness as well as quality in comparison to the source datasets.
- “Pretrained Transformers for Simple Question Answering” by Denis Lukovnikov, Asja Fischer and Jens Lehmann (Research track).
Topic: QA
Abstract: Answering simple questions over knowledge graphs is a well-studied problem in question answering. Previous approaches for this task built on recurrent and convolutional neural networks (RNNs and CNNs) based architectures that use pretrained word embeddings. It was recently shown that a pretrained transformer network (BERT) can outperform RNN- and CNN based approaches on various natural language processing tasks. In this work, we investigate how well network BERT performs on the entity span prediction and relation prediction subtasks of simple QA. In addition, we provide an evaluation of both BERT and BiLSTM-based models in limited data scenarios.
- “LC-QuAD 2.0: A large dataset for complex question answering over Wikidata and DBpedia” by Mohnish Dubey, Debayan Banerjee, Abdelrahman Abdelkawi and Jens Lehmann (Resources track).
Topic: QA
Abstract: Providing machines with the capability of exploring knowledge graphs and answering natural language questions has been an active area of research over the past decade. In this direction translating natural language questions to formal queries has been one of the key approaches. To advance the research area, several datasets like WebQuestions, QALD and LCQuAD have been published in the past. The biggest data set available for complex questions (LCQuAD) over knowledge graphs contains five thousand questions. We now provide LC-QuAD 2.0 (Large-Scale Complex Question Answering Dataset) with 30,000 questions, their paraphrases and their corresponding SPARQL queries. LC-QuAD 2.0 is compatible with both Wikidata and DBpedia 2018 knowledge graphs. In this article, we explain how the dataset was created and the variety of questions available with corresponding examples. We further provide a statistical analysis of the dataset.
- “Non-Goal Oriented Dialogues using KG-Copy Networks” by Debanjan Chaudhuri, Md Rashad Al Hasan Rony, Simon Kwoczek and Jens Lehmann (Research track).
Topic: Dialogue Systems / QA
Abstract: Non-goal oriented, generative dialogue systems lack the ability to generate answers with grounded facts. A knowledge graph can be considered an abstraction of the real world consisting of well-grounded facts. This paper addresses the problem of generating well grounded responses by integrating knowledge graphs into the dialogue systems response generation process, in an end-to-end manner. A dataset for non-goal oriented dialogues is proposed in this paper in the domain of soccer, conversing on different clubs and national teams along with a knowledge graph for each of these teams. A novel neural network architecture is also proposed as a baseline on this dataset, which can integrate knowledge graphs into the response generation process, producing well articulated, knowledge grounded responses. Empirical evidence suggests that the proposed model performs better than other state-of-the-art models for knowledge graph integrated dialogue systems.
- “Learning to Rank Query Graphs for Complex Question Answering over Knowledge Graphs” by Gaurav Maheshwari, Priyansh Trivedi, Denis Lukovnikov, Nilesh Chakraborty, Asja Fischer, and Jens Lehmann (Research track).
Topic: KGQA
Abstract: In this paper, we conduct an empirical investigation of neural query graph ranking approaches for the task of complex question answering over knowledge graphs. We propose a novel self-attention based slot matching model which exploits the inherent structure of query graphs, our logical form of choice. Our proposed model generally outperforms other ranking models on two QA datasets over the DBpedia knowledge graph, evaluated in different settings. We also show that domain adaption and pre-trained language model based transfer learning yield improvements, effectively offsetting the general lack of training data.
Acknowledgment
This work was partly supported by the EU Horizon2020 projects BigDataOcean (GA no. 732310), Boost4.0 (GA no. 780732), QROWD (GA no. 723088), SLIPO (GA no. 731581), BETTER (GA 776280), QualiChain (GA 822404), CLEOPATRA (GA no. 812997), LIMBO (Grant no. 19F2029I), OPAL (no. 19F2028A), KnowGraphs (no. 860801), SOLIDE (no. 13N14456), Bio2Vec (grant no. 3454), LAMBDA (#809965), FAIRplus (#802750), the ERC project ScienceGRAPH (#819536), “Industrial Data Space Plus” (GA 01IS17031), Fraunhofer Cluster of Excellence “Cognitive Internet Technologies” (CCIT), “InclusiveOCW” (grant no. 01PE17004D), the German national funded BmBF project MLwin, the National Natural Science Foundation of China (61673304) and the Key Projects of National Social Science Foundation of China(11&ZD189), EPSRC grant EP/M025268/1, WWTF grant VRG18-013, WMF-funded GlobalFactSync project, and by the ADAPT Centre for Digital Content Technology funded under the SFI Research Centres Programme (Grant 13/RC/2106) and co-funded under the European Regional Development Fund.