Swift Logic For Big Data And Knowledge Graphs

Transcription

Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence (IJCAI-17)Swift Logic for Big Data and Knowledge GraphsLuigi Bellomarini1 , Georg Gottlob1,2 , Andreas Pieris3 , and Emanuel Sallinger11Department of Computer Science, University of Oxford, UK2Institute of Information Systems, TU Wien, Austria3School of Informatics, University of Edinburgh, UKAbstractBecause of this, and of some eye-opening showcaseprojects such as IBM Watson [High, 2012], thousands oflarge and medium-sized companies suddenly wish to managetheir own knowledge graphs, and are looking for adequateknowledge graph management systems (KGMS).The term knowledge graph originally only referred toGoogle’s Knowledge Graph, namely, “a knowledge base usedby Google to enhance its search engine’s search results withsemantic-search information gathered from a wide varietyof sources” [Wikipedia, 2017b]. Meanwhile, further Internet giants (e.g. Facebook, Amazon) as well as some othervery large companies have constructed their own knowledge graphs, and many more companies would like to maintain a private corporate knowledge graph incorporating largeamounts of data in form of facts, both from corporate andpublic sources, as well as rule-based knowledge. Such a corporate knowledge graph is expected to contain relevant business knowledge, for example, knowledge about customers,products, prices, and competitors rather than mainly worldknowledge from Wikipedia and similar sources. It should bemanaged by a KGMS, i.e., a knowledge base managementsystem (KBMS), which performs complex rule-based reasoning tasks over very large amounts of data and, in addition,provides methods and tools for data analytics and machinelearning, whence the equation:Many modern companies wish to maintain knowledge in the form of a corporate knowledge graphand to use and manage this knowledge via a knowledge graph management system (KGMS). We formulate various requirements for a fully-fledgedKGMS. In particular, such a system must be capable of performing complex reasoning tasks but,at the same time, achieve efficient and scalable reasoning over Big Data with an acceptable computational complexity. Moreover, a KGMS needs interfaces to corporate databases, the web, and machinelearning and analytics packages. We present KRRformalisms and a system achieving these goals.1IntroductionThe so-called knowledge economy, characteristic for the current Information Age, is rapidly gaining ground. Accordingto [Amidon et al., 2005], as cited in [Wikipedia, 2017a],“The knowledge economy is the use of knowledge [.] togenerate tangible and intangible values. Technology, and, inparticular, knowledge technology, help to transform a part ofhuman knowledge to machines. This knowledge can be usedby decision support systems in various fields and generateeconomic value.” The importance of knowledge as an essential economic driving force has been evident to most corporate decision makers since the late 1970s, and the idea ofstoring knowledge and processing it to derive valuable newknowledge existed in the context of expert systems. Alas, itseems that the technology of those ‘early’ times was not sufficiently mature: the available hardware was too slow and mainmemory too tight for more complex reasoning tasks; databasemanagement systems were too slow and too rigid; there wasno web where an expert system could acquire data; machinelearning, and, in particular, neural networks were ridiculedas largely unsuccessful; ontological reasoning was in its infancy and the available formalisms were much too complexfor Big Data applications. Meanwhile, there has been hugetechnological progress, and also much research progress thathas led to a better understanding of many aspects of knowledge processing and reasoning with large amounts of data.Hardware has evolved, database technology has significantlyimproved, there is a (semantic) web with linked open data,companies can participate in social networks, machine learning has made a dramatic breakthrough, and there is a betterunderstanding of scalable reasoning mechanisms.KGMS KBMS Big Data AnalyticsThe word ‘graph’ in this context is often misunderstood tothe extent that some IT managers think that acquiring agraph database system and feeding it with data is sufficientto achieve a corporate knowledge graph. Others erroneouslythink that knowledge graphs necessarily use RDF triple storesinstead of plain relational data. Yet others think that knowledge graphs are limited to storing and analyzing social network data only. While knowledge graphs should indeed beable to manipulate graph data and reason over RDF and social networks, they should not be restricted to this. For example, restricting a knowledge graph to contain RDF data onlywould exclude the direct inclusion of standard relational dataand the direct interaction with corporate databases.Not much has been described in the literature about thearchitecture of a KGMS and the functions it should ideallyfulfil. In Section 2 we briefly list what we believe are themain requirements for a fully fledged KGMS. As indicated inFigure 1, which depicts our reference architecture, the centralcomponent of a KGMS is its core reasoning engine, whichhas access to a rule repository. Grouped around it are various2

Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence (IJCAI-17)modules that provide relevant data access and analytics functionalities (see Section 2 for details). We expect a KGMS tofulfil many of these functions.The reasoning core of a KGMS needs to provide a language for knowledge representation and reasoning (KRR).The data format for factual data should, as said, match thestandard relational formalism so as to smoothly integrate corporate databases and data warehouses, and at the same timebe suited for RDF and graph data. The rule language andreasoning mechanism should achieve a careful balance between expressive power and complexity. In Section 3 wepresent VADALOG, a Datalog-based language that matchesthis requirement. VADALOG belongs to the Datalog family of languages that extend Datalog by existential quantifiersin rule heads, as well as by other features, and restricts at thesame time its syntax so as to achieve decidability and datatractability; see, e.g., [Calì et al., 2013; Calì et al., 2012a;Calì et al., 2010; Calì et al., 2012b]. The logical core of theVADALOG language corresponds to Warded Datalog [Arenas et al., 2014; Gottlob and Pieris, 2015], which capturesplain Datalog as well as SPARQL queries under the entailment regime for OWL 2 QL [Glimm et al., 2013], and is ableto perform ontological reasoning tasks. Reasoning with thelogical core of VADALOG is computationally efficient.After discussing the logical core of VADALOG and its beneficial properties in Section 3.1, we describe in Section 3.2several features that have been added to it for achieving morepowerful reasoning and data manipulation capabilities. Togive just one example here, the language is augmented bymonotonic aggregations [Shkapsky et al., 2015], which permits the use of aggregation (via summation, product, max,min, count) even in the presence of recursion. This enables usto swiftly solve problems such as the company control problem (studied e.g. in [Ceri et al., 2012]) as explained in thefollowing example, which will serve as a running example.Figure 1: KGMS Reference Architecture.aggressive recursion control. The VADALOG system is Oxford’s contribution to the VADA (Value Added Data Systems) research project [VADA, 2016; Furche et al., 2016;Konstantinou et al., 2017], which is a joint effort of the universities of Edinburgh, Manchester, and Oxford. An outlookon future research and developments is given in Section 5.2Desiderata for a KGMSWe proceed to briefly summarize what we think are the mostimportant desiderata for a fully-fledged KGMS. We will listthese requirements according to three categories, keeping inmind, however, that these categories are interrelated.2.1Language and System for ReasoningThere should be a logical formalism for expressing facts andrules, and a reasoning engine that uses this language, whichshould provide the following features.Simple and Modular Syntax: It should be easy to add anddelete facts and to add new rules. As in logic programming,facts should conceptually coincide with database tuples.High Expressive Power: Datalog [Ceri et al., 2012; Huanget al., 2011] is a good yardstick for the expressive power ofrule languages. Over ordered structures (which we may assume here), Datalog with very mild negation captures PTIME;see, e.g., [Dantsin et al., 2001]. A rule language should thusideally be at least as expressive as plain recursive Datalog,possibly with mild negation.Numeric Computation and Aggregations: The basic logicalformalism and inference engine should be enriched by features for dealing with numeric values, including appropriateaggregate functions.Probabilistic Reasoning: The language should be suited forincorporating appropriate methods of probabilistic reasoning,and the system should propagate probabilities or certaintyvalues along the reasoning process, that is, compute probabilities or certainty values for derived facts, and make adjustments wherever necessary. Probabilistic models may rangefrom simple triangular norm operators (T-norm – cf [Hájek,1998]) over probabilistic database models [Suciu et al., 2011]to Markov logic networks [Richardson and Domingos, 2006].Ontological Reasoning: Ontological reasoning and query answering should be provided. We have two yardsticks here.Example 1.1 (Running Example) Assume the ownershiprelationship among a large number of companies is storedvia facts (i.e., tuples of a database relation) of the formOwn(comp1 , comp2 , w) meaning that company comp1 directly owns a fraction w of company comp2 , with 0 w 1.A company x controls a company y if x directly owns morethan half of the shares of y or if x controls a set S of companies that jointly own more than half of y. Computing apredicate Control(x, y) expressing that company x controlscompany y, is then achieved in VADALOG by two rules:Own(x, y, w), w 0.5 Control(x, y)Control(x, y), Own(y, z, w),v msum(w, hyi), v 0.5 Control(x, z).Here, for fixed x, the aggregate construct msum(w, hyi)forms the sum over all values w such that for some companyy, Control(x, y) is true, and Own(y, z, w) holds, i.e., company y directly owns fraction w of company z. In Section 4 we introduce the VADALOG KGMS, which buildson the VADALOG language and combines it with existingand novel techniques from database and AI practice such asstream query processing, dynamic in-memory indexing and3

Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence (IJCAI-17)First, ontological reasoning to the extent of tractable description logics such as DL-LiteR should be possible. Recall thatDL-LiteR forms the logical underpinning of the OWL 2 QLprofile of the Web Ontology Language as standardized bythe W3C. Second, it should be expressive enough to coverall SPARQL queries over RDF datasets under the entailmentregime for OWL 2 QL [Glimm et al., 2013].Low Complexity: Reasoning should be tractable in data complexity (i.e. when the rules are assumed to be fixed and thefact base is considered the input). Whenever possible, thesystem should recognize and take profit of rule sets that canbe processed within low space complexity classes such asNLOGSPACE (e.g. for SPARQL) or even AC 0 (e.g. for traditional conjunctive database queries).Rule Repository, Rule Management, and Ontology Editor: Alibrary for storing recurring rules and definitions should beprovided, as well as a user interface for rule management inthe spirit of the ontology editor protégé [Noy et al., 2001].Dynamic Orchestration: For larger applications, there mustbe a master module to allow the orchestration of complex dataflows. For simple systems, the process must be easily specifiable. For complex systems, the process must be dynamicallycontrollable through intelligent reasoning techniques or external control facilities and tools (e.g. BPM).2.3Embedding Procedural and Third-Party Code2.2The logical core of VADALOG is a member of the Datalog family of knowledge representation languages, which we callWarded Datalog . The main goal of Datalog languages isto extend the well-known language Datalog with useful modeling features such as existential quantifiers in rule heads (the‘ ’ in the symbol ‘ ’), and at the same time restrict the rulesyntax in such a way that the decidability and data tractabilityof reasoning is guaranteed (the ‘-’ in the symbol ‘ ’).The core of Datalog languages consists of rules known asexistential rules or tuple-generating dependencies, which essentially generalize Datalog rules with existential quantifiersin rule heads; henceforth, we adopt the term existential rule.An example of such an existential rule isProcedural Code: The system should have encapsulationmethods for embedding procedural code (proprietary andthird party) written in a variety of programming languagesand offer a logical interface to it.Third-Party Packages for Machine Learning, Text Mining,NLP, Data Analytics, and Data Visualization: The systemshould be equipped with direct access to powerful existingsoftware packages for machine learning, text mining, dataanalytics, and data visualization. Given that excellent thirdparty software for these purposes exists, we believe that aKGMS should be able to use a multitude of such packagesvia appropriate logical interfaces.3The VADALOG LanguageAs said before, VADALOG is a KR language that achievesa careful balance between expressive power and complexity,and it can be used as the reasoning core of a KGMS. In Section 3.1 we discuss the logical core of VADALOG and someinteresting fragments of it, while in Section 3.2 we discusshow this language can be extended with additional featuresthat are much needed in real-world applications.3.1Accessing and Handling Big DataBig Data Access: The system must be able to provide efficient access to Big Data sources and systems and fast reasoning algorithms over Big Data. In particular, the possibility ofout-of-memory reasoning must be given in case the relevantdata does not fit into main memory. Integration of Big Dataprocessing techniques should be possible where the volumeof data makes it necessary (see e.g. [Shkapsky et al., 2016]).Database and Data Warehouse Access: Seamless access torelational, graph databases, data warehouses, RDF stores, andmajor NoSQL stores should be granted. Data in such repositories should be directly usable as factual data for reasoning.Ontology-based Data Access (OBDA): OBDA [Calvanese etal., 2011] allows a system to compile a query that has beenformulated on top of an ontology into one directly on thedatabase. OBDA should be possible whenever appropriate.Multi-Query Support: Where possible and appropriate, partial results from repeated (sub-)queries should be evaluatedonce [Roy et al., 2000] and optimized in this regard.Data Cleaning, Exchange and Integration: Integrating, exchanging and cleaning data should be supported both directly(through an appropriate KRR formalism that is made available through various applications in the knowledge repository), and by allowing integration of third-party software.Web Data Extraction, Interaction, and IoT: A KGMS shouldbe able to interact with the web by (i) extracting relevant webdata (e.g. prices advertised by competitors) and integratingthese data into the local fact base, and (ii) exchanging datawith web forms and servers that are available through a webinterface. One way to achieve this will be discussed in Section 3.2. Similar methods can be used for interacting with theIoT through appropriate network accessible APIs.Core LanguagePerson(x) y HasFather(x, y), Person(y)which encodes that every person has a father who is also aperson. In general, an existential rule is a first-order sentence x̄ ȳ(ϕ(x̄, ȳ) z̄ ψ(x̄, z̄))where ϕ (the body) and ψ (the head) are conjunctions ofatoms with constants and variables.The semantics of a set of existential rules Σ over a databaseD, denoted Σ(D), is defined via the well-known chase procedure. Roughly, the chase adds new atoms to D (possibly involving null values used for satisfying the existentially quantified variables) until the final result Σ(D) satisfies all the existential rules of Σ. Notice that, in general, Σ(D) is infinite.Here is a simple example of the chase procedure.Example 3.1 Consider the database D {Person(Bob)},and the existential rulePerson(x) y HasFather(x, y), Person(y).4

Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence (IJCAI-17)2. the ward can share only “harmless” variables with therest of the body, i.e., variables that are unified only withdatabase constants during the construction of the chase.Warded Datalog consists of all the (finite) sets of wardedexistential rules. The rule in Example 3.1 is clearly warded.Another example of a warded set of existential rules follows:The database atom triggers the above existential rule, and thechase adds in D the atomsHasFather(Bob, ν1 )and Person(ν1 )in order to satisfy it, where ν1 is a (labeled) null representing some unknown value. The new atom Person(ν1 ) triggersagain the existential rule, and the chase adds the atomsHasFather(ν1 , ν2 )Example 3.2 Consider the following rules encoding part ofthe OWL 2 direct semantics entailment regime for OWL 2 QL(see [Arenas et al., 2014; Gottlob and Pieris, 2015]):and Person(ν2 ),where ν2 is a new null. The result of the chase is the instanceType(x, y), Restriction(y, z){Person(Bob), HasFather(Bob, ν1 )} [{Person(νi ), HasFather(νi , νi 1 )},Type(x, y), SubClass(y, z)Triple(x, y, z), Inverse(y, w)i 0Triple(x, y, z), Restriction(w, y)where ν1 , ν2 , . . . are (labeled) nulls. w Triple(x, z, w) Type(x, z) Triple(z, w, x) Type(x, w).It is easy to verify that the above set is warded, where the underlined atoms are the wards. Indeed, a variable that occursin an atom of the form Restriction(·, ·), or SubClass(·, ·), orInverse(·, ·), is trivially harmless. However, variables thatappear in the first position of Type, or in the first/third position of Triple can be dangerous. Thus, the underlined atomsare indeed acting as the wards.Let us now intuitively explain the meaning of the above setof existential rules: The first rule states that if a is of type b,encoded via the atom Type(a, b), while b represents the classthat corresponds to the first attribute of some binary relationc, encoded via the atom Restriction(b, c), then there existssome value d such that the tuple (a, d) occurs in the binaryrelation c, encoded as the atom Triple(a, c, d). Analogously,the other rules encode the usual meaning of subclasses, inverses and the effect of restrictions on types. Given a pair Q (Σ, Ans), where Σ is a set of existentialrules and Ans an n-ary predicate, the evaluation of Q overa database D, denoted Q(D), is defined as the set of tuplesover the set CD of constant values occurring in the databaseD that are entailed by D and Σ, i.e., the set{ht1 , . . . , tn i Ans(t1 , . . . , tn ) Σ(D) and each ti CD }.The main reasoning task that we are interested in is tuple inference: given a database D, a pair Q (Σ, Ans), and atuple of constants t̄, decide whether t̄ Q(D). This problemis very hard; in fact, it is undecidable, even when Q is fixedand only D is given as input [Calì et al., 2013]. This has ledto a flurry of activity for identifying restrictions on existential rules that make the above problem decidable. Each suchrestriction gives rise to a new Datalog language.Let us clarify that Warded Datalog is a refinement ofWeakly-Frontier-Guarded Datalog , which is defined in thesame way but without the condition (2) given above [Baget etal., 2011]. Weakly-Fr

for Big Data applications. Meanwhile, there has been huge technological progress, and also much research progress that has led to a better understanding of many aspects of knowl-edge processing and reasoning with large amounts of data. Hardware has evolved, database technology has signicantly improved, there is a (semantic) web with linked open .File Size: 742KB