Personal data are increasingly disseminated over the Web through mobile devices and smart environments, and are exploited for developing more and more sophisticated services and applications. All these advances come with serious risks for privacy breaches that may reveal private information wanted to remain undisclosed by data producers. It is therefore of utmost importance to help them to identify privacy risks raised by requests of service providers for utility purposes. In this paper, we first formalize privacy risks by privacy queries expressed (and kept secret) by each data producer to specify the data they do not want to be disclosed. Then, we develop a formal approach for detecting incompatibility between privacy and utility queries expressed as temporal aggregate conjunctive queries. The distinguishing point of our approach is to be data-independent and to come with an explanation based on the query expressions only. This explanation is intended to help data producers understand the detected privacy breaches and guide their choice of the appropriate technique to correct it.
In this paper, we present PARTICUL, a novel algorithm for unsupervised learning of part detectors from datasets used in fine-grained recognition. It exploits the macro-similarities of all images in the training set in order to mine for recurring patterns in the feature space of a pre-trained convolutional neural network. We propose new objective functions enforcing the locality and unicity of the detected parts. Additionally, we embed our detectors with a confidence measure based on correlation scores, allowing the system to estimate the visibility of each part. We apply our method on two public fine-grained datasets (Caltech-UCSD Bird 200 and Stanford Cars) and show that our detectors can consistently highlight parts of the object while providing a good measure of the confidence in their prediction. We also demonstrate that these detectors can be directly used to build part-based fine-grained classifiers that provide a good compromise between the transparency of prototype-based approaches and the performance of non-interpretable methods.
Despite their potential, CONSTRUCT queries have gained little attraction so far among data practitioners, vendors and researchers. In this paper, we first exhibit performance bottlenecks of existing triplestores for supporting CONSTRUCT queries over large knowledge graphs. Then, we describe a novel Spark-based architecture for big triplestores, called TESS, that we have designed and implemented to overcome the above limitations by using parallel computing. TESS ensures ACID properties that are required for a sound and complete implementation of CONSTRUCT-based forward-chaining rules reasoning
Specialized ontologies are constructed to capture the skills of experienced experts, with the goal of sharing them with a larger community of trainees and less experienced experts in the domain. A challenging task in the engineering pipeline of specialized ontologies is ontology updating, which encompasses enrichment and population. Domain experts require an understanding of the RDF notation and OWL semantics to update their ontologies, which is not often the case. In this paper, we present IOPE, an interactive framework for the automatic construction of a Graphical User Interface (GUI) to support the controlled update process, by enabling the experts to easily interact with their ontology. We contribute a set of "mapping rules" to transform the ontological constraints into interactive widgets organized in the form of pre-filled Web pages in the GUI, and a set of "binding rules" to transform the expert interactions into RDF graphs, and hence perform the updates. In an extensive set of experiments, we illustrate the ecacy of IOPE in empowering medical experts to update their specialized ontology.
Privacy is a major concern when publishing new datasets in the context of Linked Open Data (LOD). A new dataset published in the LOD is indeed exposed to privacy breaches due to the linkage to objects already present in the other datasets of the LOD. In this paper, we focus on the problem of building \emph{safe} anonymizations of an RDF graph to guarantee that linking the anonymized graph with any external RDF graph will not cause privacy breaches. Given a set of privacy queries as input, we study the data-independent safety problem and the sequence of anonymization operations necessary to enforce it. We provide sufficient conditions under which an anonymization instance is safe given a set of privacy queries. Additionally, we show that our algorithms for RDF data anonymization are robust in the presence of \texttt{sameAs} links that can be explicit or inferred by additional knowledge.
We introduce OntoSIDES, the core of an ontology-based learning management system in Medicine, in which the educational content, the traces of students' activities and the correction of exams are linked and related to items of an official reference program in a unified RDF data model. OntoSIDES is an RDF knowledge base comprised of a lightweight domain ontology that serves as a pivot high-level vocabulary of the query interface with users, and of a dataset made of factual statements relating individual entities to classes and properties of the ontology. Thanks to an automatic mapping-based data materialization and rule-based data saturation, OntoSIDES contains around 8 millions triples to date, and provides an integrated access to useful information for student progress monitoring, using a powerful query language (namely SPARQL) allowing users to express their specific needs of data exploration and analysis. Since we do not expect end-users to master the raw syntax of SPARQL and to express directly complex queries in SPARQL, we have designed a set of parametrized queries that users can customize through a user-friendly interface.
We introduce and develop a declarative framework for privacy- preserving Linked Data publishing in which privacy and utility policies are specied as SPARQL queries. Our approach is data-independent and leads to inspect only the privacy and utility policies in order to deter- mine the sequence of anonymization operations applicable to any graph instance for satisfying the policies. We prove the soundness of our algo- rithms and gauge their performance through experiments.
Linked Data provides access to huge, continuously growing amounts of open data and ontologies in RDF format that describe entities, links and properties on those entities. Equipping Linked Data with inference paves the way to make the Semantic Web a reality. In this survey, we describe a unifying framework for RDF ontologies and databases that we call deductive RDF triplestores. It consists in equipping RDF triplestores with Datalog inference rules. This rule language allows to capture in a uniform manner OWL constraints that are useful in practice, such as property transtivity or symmetry, but also domain-specific rules with practical relevance for users in many domains of interest. The expressivity and the genericity of this framework is illustrated for modeling Linked Data applications and for developing inference algorithms. In particular, we show how it allows to model the problem of data linkage in Linked Data as a reasoning problem on possibly decentralized data.We also explain how it makes possible to efficiently extract expressive modules from Semantic Web ontologies and databases with formal guarantees, whilst effectively controlling their succinctness. Experiments conducted on real-world datasets have demonstrated the feasibility of this approach and its usefulness in practice for data integration and information extraction
Discovering whether or not two URIs described in Linked Data, in the same or different RDF datasets, refer to the same real-world entity is crucial for building applications that exploit the cross-referencing of open data. A major challenge in data interlinking is to design tools that effectively deal with incomplete and noisy data, and exploit uncertain knowledge. In this paper, we model data interlinking as a reasoning problem with uncertainty. We introduce a probabilistic framework for modelling and reasoning over uncertain RDF facts and rules that is based on the semantics of probabilistic Datalog. We have designed an algorithm, ProbFR, based on this framework. Experiments on real-world datasets have shown the usefulness and effectiveness of our approach for data linkage and disambiguation.
Ontology-Based Data Access has been studied so far for relational structures and deployed on top of relational databases. This paradigm enables a uniform access to heterogeneous data sources, also coping with incomplete information. Whether OBDA is suitable also for non-relational structures, like those shared by increasingly popular NOSQL languages, is still an open question. In this paper, we study the problem of answering ontology-mediated queries on top of key-value stores. We formalize the data model and core queries of these systems, and introduce a rule language to express lightweight ontologies on top of data. We study the decidability and data complexity of query answering in this setting.
We introduce TopPI, a new semantics and algorithm designed to mine long-tailed datasets. For each item, and regardless of its frequency, TopPI finds the k most frequent closed itemsets that item belongs to. For example, in our retail dataset, TopPI finds the itemset " nori seaweed, wasabi, sushi rice, soy sauce " that occurrs in only 133 store receipts out of 290 million. It also finds the itemset " milk, puff pastry " , that appears 152,991 times. Thanks to a dynamic threshold adjustment and an adequate pruning strategy, TopPI efficiently traverses the relevant parts of the search space and can be parallelized on multi-cores. Our experiments on datasets with different characteristics show the high performance of TopPI and its superiority when compared to state-of-the-art mining algorithms. We show experimentally on real datasets that TopPI allows the analyst to explore and discover valuable itemsets.
In this work, we present a novel ontology describing the development of the human embryology deforming 3D models. Beyond describing how organs and structures are composed, our ontology integrates a procedural description of their 3D representations, temporal deformation and relations with respect to their developments. We also created inferences rules to express complex connections between entities. It results in a unified description of both the knowledge of the organs deformation and their 3D representations enabling to visualize dynamically the embryo deformation during the Carnegie stages. Through a simplified ontology, containing representative entities which are linked to spatial position and temporal process information, we illustrate the added-value of such a declarative approach for interactive simulation and visualization of 3D embryos.
We present a novel semantics for extracting bounded-level modules from RDF ontologies and databases augmented with safe inference rules, \`a la Datalog. Dealing with a recursive rule language poses challenging issues for defining the module semantics, and also makes module extraction algorithmically unsolvable in some cases. Our results include a set of module extraction algorithms compliant with the novel semantics. Experimental results show that the resulting framework is effective in extracting expressive modules from RDF datasets with formal guarantees, whilst controlling their succinctness.
In this paper we model the problem of data linkage in Linked Data as a reasoning problem on possibly decentralized data. We describe a novel import-by-query algorithm that alternates steps of sub-query rewriting and of tailored querying the Linked Data cloud in order to import data as specific as possible for inferring or contradicting given target same-as facts. Experiments conducted on a real-world dataset have demonstrated the feasibility of this approach and its usefulness in practice for data linkage and disambiguation.
First, we provide a new ontology of anatomy called My Corporis Fabrica (MyCF), which conforms to FMA but extends it by making explicit how anatomical structures are composed, how they contribute to functions, and also how they can be related to 3D complex objects. Second, we have equipped MyCF with automatic reasoning capabilities that enable model checking and complex queries answering. We illustrate the added-value of such a declarative approach for interactive simulation and visualization as well as for teaching applications.
The problem that we address in this paper is how to improve the accuracy of retrieving specialized information within a textual scientific corpus. We present a new approach in which the keywords expressing the bibliographical needs of a researcher are related to a domain ontology. We illustrate how such a declarative ontolology-based approach can be used both for computing varied statistics, and also for helping experts to find useful fine-grained information within a textual corpus.
The analysis of multimedia application traces can reveal important information to enhance program execution comprehension. However typical size of traces can be in gigabytes, which hinders their effective exploitation by application developers. In this paper, we study the problem of finding a set of sequences of events that allows a reduced-size rewriting of the original trace. These sequences of events, that we call blocks, can simplify the exploration of large execution traces by allowing application developers to see an abstraction instead of low-level events. The problem of computing such set of blocks is NP-hard and naive approaches lead to prohibitive running times that prevent analysing real world traces. We propose a novel algorithm that directly mines the set of blocks. Our experiments show that our algorithm can analyse real traces of up to two hours of video. We also show experimentally the quality of the set of blocks proposed, and the interest of the rewriting to understand actual trace data.
In this paper we introduce a mechanism of trust adapted to semantic peer-to-peer (P2P) networks in which every peer is free to organize its local resources as instances of classes of its own ontology. Peers use their ontologies to query other peers, and alignments between peers' ontologies make it possible to reformulate queries from one local peer's vocabulary to another. Alignments are typically the result of manual or (semi-)automatic ontology matching. However, resulting alignments may be unsound and/or incomplete, and, therefore, query reformulation based on alignments may lead to unsatisfactory answers. Trust can assist peers to select the peers in the network that are better suited to answer their queries. In our model, the trust that a peer has towards another peer depends on a specific query, and it represents the probability that the latter peer will provide a satisfactory answer to the query. In order to compute trust, we perform Bayesian inference that exploits ontologies, alignments and user feedback. We have implemented our method and conducted an evaluation. Experimental results show that trust values converge as more queries are sent and answers received. Furthermore, when query-answering is guided by trust the quality of peers' answers, measured with precision and recall, is improved.
In this paper, we present ParaMiner which is a generic and parallel algorithm for closed pattern mining. Para Miner is built on the principles of pattern enumeration in strongly accessible set systems. Its efficiency is due to a novel dataset reduction technique (that we call EL-reduction), combined with novel technique for performing dataset reduction in a parallel execution on a multi-core architecture. We illustrate Para Miner’s genericity by using this algorithm to solve three different pattern mining problems: the frequent itemset mining problem, the mining frequent connected relational graphs problem and the mining gradual itemsets problem. In this paper, we prove the soundness and the completeness of Para Miner. Furthermore, our experiments show that despite being a generic algorithm, Para Miner can compete with specialized state of the art algorithms designed for the pattern mining problems mentioned above. Besides, for the particular problem of gradual itemset mining, Para Miner outperforms the state of the art algorithm by two orders of magnitude.
The current trend for building an ontology-based data management system (DMS) is to capitalize on efforts made to design a preexisting well-established DMS (a reference system). The method amounts to extracting from the reference DMS a piece of schema relevant to the new application needs – a module –, possibly personalizing it with extra-constraints w.r.t. the application under construction, and then managing a dataset using the resulting schema. In this paper, we extend the existing definitions of modules and we introduce novel properties of robustness that provide means for checking easily that a robust module-based DMS evolves safely w.r.t. both the schema and the data of the reference DMS. We carry out our investigations in the setting of description logics which underlie modern ontology languages, like RDFS, OWL, and OWL2 from W3C. Notably, we focus on the DL-liteA dialect of the DL-lite family, which encompasses the foundations of the QL profile of OWL2(i.e., DL-liteR): the W3C recommendation for efficiently managing large datasets.
The ambition of this book is to cover the many facets of distributed data management on the Web. We will explain the foundations of the Web standard for data management, XML. We will travel in logical countries (e.g., description logic), that provide foundations for the Semantic Web that is emerging in modern data integration applications. We will show the beauty of software tools that everyone is already using today, for example Web search engines. And finally, we will explain the impressive machinery used nowadays to manipulate amount of data of unprecedented size. We are witnessing an emergence of a new, global information system created, explored, and shared by the whole humankind. The book aims at exposing the recent achievements that help make this system usable.
In a semantic P2P network, peers use separate ontologies and rely on alignments between their ontologies for translating queries. Nonetheless, alignments may be limited |unsound or incomplete| and generate awed translations, leading to unsatisfactory answers. In this paper we present a trust mechanism that can assist peers to select those in the network that are better suited to answer their queries. The trust that a peer has towards another peer depends on a specic query and represents the probability that the latter peer will provide a satisfactory answer. In order to compute trust, we exploit both alignments and peers' direct experience, and perform Bayesian inference.We have implemented our technique and conducted an evaluation. Experimental results showed that trust values converge as more queries are sent and answers received. Furthermore, the use of trust improves both precision and recall.
In this paper, we investigate a principled approach for defining and discovering probabilistic mappings between two taxonomies. First, we compare two ways of modeling probabilistic mappings which are compatible with the logical constraints declared in each taxonomy. Then we describe a generate and test algorithm which minimizes the number of calls to the probability estimator for determining those mappings whose probability exceeds a certain threshold. Finally, we provide an experimental analysis of this approach.In this paper, we investigate a principled approach for defining and discovering probabilistic mappings between two taxonomies. First, we compare two ways of modeling probabilistic mappings which are compatible with the logical constraints declared in each taxonomy. Then we describe a generate and test algorithm which minimizes the number of calls to the probability estimator for determining those mappings whose probability exceeds a certain threshold. Finally, we provide an experimental analysis of this approach.
This paper provides a decentralized data model and associated algorithms for peer data management systems (PDMS) based on the DL-LiteR description logic. Our approach relies on reducing query reformulation and consistency checking for DL-LiteR into reasoning in propositional logic. This enables a straightforward deployment of DL-LiteR PDMSs on top of SomeWhere, a scalable propositional peer-to-peer inference system. We also show how to use the state-of-the-art Minicon algorithm for rewriting queries using views in DL-LiteR in the centralized and decentralized cases.
The reference reconciliation problem consists in deciding whether different identifiers refer to the same data, i.e. correspond to the same real world entity. In this article we present a reference reconciliation approach which combines a logical method for reference reconciliation called L2R and a numerical one called N2R. This approach exploits the schema and data semantics, which is translated into a set of Horn FOL rules of reconciliation. These rules are used in L2R to infer exact decisions both of reconciliation and non-reconciliation. In the second method N2R, the semantics of the schema is translated in an informed similarity measure which is used by a numerical computation of the similarity of reference pairs. This similarity measure is expressed in a non linear equation system, which is solved by using an iterative method. The experiments of the methods made on two different domains, show good results for both recall and precision. They can be used separately or in combination. We have shown that their combination allows to improve runtime performance.
In this paper we present a new tree mining algorithm, DryadeParent, based on the hooking principle first introduced in Dryade. In the experiments, we demonstrate that the branching factor and depth of the frequent patterns to find are key factors of complexity for tree mining algorithms. We show that DryadeParent outperforms the current fastest algorithm, CMTreeMiner,by orders of magnitude on data sets where the frequent tree patterns have a high branching factor.
Semantic peer to peer (P2P) systems are fully decentralized overlay networks of people or machines (called peers) sharing and searching varied resources (documents, videos, photos, data, services) based on their semantic annotations using ontologies. They provide a support for the emergence of open and decentralized electronic social networks, in which no central or external authority can control the reliability of the peers participating to the network. This lack of control may however cause some of the results provided by some peers to be unsatisfactory, because of inadequate or obsolete annotations. In this paper, we propose a probabilistic model to handle trust in a P2P setting. Our model supports a local computation of the trust of peers into classes of other peers. We claim that our model is well appropriate to the dynamic of P2P networks and to the freedom of each peer within the network to have different viewpoints towards the peers with whom he/she interacts.
The reference reconciliation problem consists in deciding whether different identifiers refer to the same data, i.e., correspond to the same world entity. The L2R system exploits the semantics of a rich data model, which extends RDFS by a fragment of OWL-DL and SWRL rules. In L2R, the semantics of the schema is translated into a set of logical rules of reconciliation, which are then used to infer correct decisions both of reconciliation and no reconciliation. In contrast with other approaches, the L2R method has a precision of 100% by construction. First experiments show promising results for recall, and most importantly significant increases when rules are added.
The Semantic Web envisions a world-wide distributed architecture where computational resources will easily inter-operate to coordinate complex tasks such as query answering. Semantic marking up of web resources using ontologies is expected to provide the necessary glue for making this vision work. Using ontology languages, (communities of) users will build their own ontologies in order to describe their own data. Adding semantic mappings between those ontologies, in order to semantically relate the data to share, gives rise to the Semantic Web: data on the web that are annotated by ontologies networked together by mappings. In this vision, the Semantic Web is a huge semantic peer data management system. In this paper, we describe the SomeRDFS peer data management systems that promote a "simple is beautiful" vision of the Semantic Web based on data annotated by RDFS ontologies.
XML documents are becoming ubiquitous because of their rich and flexible format that can be used for a variety of applications. Giving the increasing size of XML collections as information sources, mining techniques that traditionally exist for text collections or databases need to be adapted and new methods to be invented to exploit the particular structure of XML documents. Basically XML documents can be seen as trees, which are well known to be complex structures. This chapter describes various ways of using and simplifying this tree structure to model documents and support efficient mining algorithms. We focus on three mining tasks: classification and clustering which are standard for text collections; discovering of frequent tree structure which is especially important for heterogeneous collection. This chapter presents some recent approaches and algorithms to support these tasks together with experimental evaluation on a variety of large XML collections.
In this invited talk, we present the SomeWhere approach and infrastructure for building semantic peer-to-peer data management systems based on simple personalized ontologies distributed at a large scale. Somewhere is based on a simple class-based data model in which the data is a set of resource identifiers (e.g., URIs), the schemas are (simple) definitions of classes possibly constrained by inclusion, disjunction or equivalence statements, and mappings are inclusion, disjunction or equivalence statements between classes of different peer ontologies. In this setting, query answering over peers can be done by distributed query rewriting, which can be equivalently reduced to distributed consequence finding in propositional logic. It is done by using the message-passing distributed algorithm that we have implemented for consequence finding of a clause w.r.t a set of distributed propositional theories. We summarize its main properties (soundness, completeness and termination), and we report experiments showing that it already scales up to a thousand of peers. Finally, we mention ongoing work on extending the current data model to RDF(S) and on handling possible inconsistencies between the ontologies of different peers.
In a peer-to-peer inference system, there is no centralized control or hierarchical organization: each peer is equivalent in functionality and cooperates with other peers in order to solve a collective reasoning task. Since peer theories model possibly different viewpoints, even if each local theory is consistent, the global theory may be inconsistent. We exhibit a distributed algorithm detecting inconsistencies in a fully decentralized setting. We provide a fully distributed reasoning algorithm, which computes only well-founded consequences of a formula, i.e., with a consistent set of support.
This paper introduces a compact representation which helps to avoid the exponential blow-up in space of the Least Common Subsumer (lcs) of two ALE-concept descriptions. Based on the compact representation we define a space of specific graphs which represents all ALE-concept descriptions including the lcs. Next, we propose an algorithm exponential in time and polynomial in space for deciding subsumption between concept descriptions represented by graphs in this space. These results provide better understanding of the double exponential blow-up of the approximation of ALC-concept descriptions by ALE-concept descriptions: double exponential size of the approximation in the ordinary representation is unavoidable in the worst case.
In a peer-to-peer inference system, each peer can reason locally but can also solicit some of its acquaintances, which are peers sharing part of its vocabulary. In this paper, we consider peer-to-peer inference systems in which the local theory of each peer is a set of propositional clauses defined upon a local vocabulary. An important characteristic of peer-to-peer inference systems is that the global theory (the union of all peer theories) is not known (as opposed to partition-based reasoning systems). The main contribution of this paper is to provide the first consequence finding algorithm in a peer-to-peer setting: it is anytime and computes consequences gradually from the solicited peer to peers that are more and more distant. We exhibit a sufficient condition on the acquaintance graph of the peer-to-peer inference system for guaranteeing the completeness of this algorithm. Another important contribution is to apply this general distributed reasoning setting to the setting of the Semantic Web through the SomeWhere semantic peer-to-peer data management system. The last contribution of this paper is to provide an experimental analysis of the scalability of the peer-to-peer infrastructure that we propose, on large networks of 1000 peers.
In this paper, we describe the SomeWhere semantic peer-to-peer data management system that promotes a "small is beautiful" vision of the Semantic Web based on simple personalized ontologies (e.g., taxonomies of classes) but which are distributed at a large scale. In this vision of the Semantic Web, no user imposes to others his own ontology. Logical mappings between ontologies make possible the creation of a web of people in which personalized semantic marking up of data cohabits nicely with a collaborative exchange of data. In this view, the Web is a huge peer-to-peer data management system based on simple distributed ontologies and mappings.
In peer-to-peer inference systems, each peer can reason locally but also solicit some of its acquaintances, sharing part of its vocabulary. This paper studies both theoretically and experimentally the problem of computing proper prime implicates for propositional peer-to-peer systems, the global theory (union of all peer theories) of which is not known (as opposed to partition-based reasoning).
In this paper, we present a new tree mining algorithm, DRYADEPARENT, based on the hooking principle first introduced in DRYADE. In the experiments, we demonstate that the branching factor and depth of the frequent patterns to find are key factors of complexity for tree mining algorithms. We whow that DRYADEPARENT outperforms the current fastest algorithm, CMTreeMiner, by orders of magnitude on datasets where the frequent patterns have a high branching factor.
In this paper we present a method for integrating and querying XML data in a relational setting. though this method is generic, it has been motivated and validated by a knowledge management application on Microbiology: the e.dot projet.
In 1984, Peter Patel-Schneider published a paper entitled "Small can be Beautiful in Knowledge Representation" in which he advocated for limiting the expressive power of knowledge representation formalisms in order to guarantee good computational properties and thus make knowledge-based systems usable as part of larger systems. In this paper, I aim at showing that the same argument holds for the Semantic Web: if we want to give a chance for the Semantic Web to scale up and to be broadly used, we have to limit the expressive power of the ontologies serving as semantic marking up of resources. In addition, due to the scale of the Web and the disparity of its users, it is unavoidable to have to deal with distributed heterogeneous ontologies. In this paper, I will argue that a peer-to-peer infrastructure enriched with simple distributed ontologies (e.g., taxonomies) reconciled through mappings is appropriate and scalable for supporting the future Semantic Web.
In this paper, we present a novel algorithm for discovering tree paterrns in a tree database. This algorithm uses a relaxed tree inclusion definition, making the problem complex (checking tree inclusion is NP-complete), but allowing to mine highly heterogeneous databases. To obtain good performances, our DRYADE algorithm discovers only closed frequent tree patterns.
In a peer to peer inference system, each peer can reason locally but can also solicit some of its acquaintances, which are peers sharing part of its vocabulary. In this paper, we consider peer to peer inference systems in which the local theory of each peer is a set of propositional clauses defined upon a local vocabulary. An important characteristic of peer to peer inference systems is that the global theory (the union of all peer theories) is not known (as opposed to partition-based reasoning systems). The contribution of this paper is twofold. We provide the first consequence finding algorithm in a peer to peer setting: it is incremental, anytime and computes consequences gradually from the solicited peer to peers that are more and more distant. We exhibit a sufficient condition on the acquaintance graph of the peer to peer inference system for guaranteeing the completeness of this algorithm. We report first promising experimental results.
Extensible Markup Language (XML) is playing an increasingly important role in the exchange of a wide variety of data on the Web and elsewhere. It is a simple, very flexible text format, used to annotate data by means of markup. XML documents can be checked for syntactic well-formedness and semantic coherence through DTD and schema validation which makes their processing easier. In particular, data with nested structure can be easily represented with embedded tags. This structured representation should be used in information retrieval models which take structure into account. As such, it is meta-data and therefore a contribution to the Semantic Web. However, nowadays, there exists huge quantities of raw texts and the issue is how to find an easy way to provide these texts with sensible XML structure. Here we present an automatic method to extract tree structure from raw texts. This work has been supported by the Paris XI University (BQR2002 project, Paris-XI University).
In this paper, we investigate a first step towards the long-term vision of the Semantic Web by studying the problem of answering queries posed through a mediated ontology to multiple information sources whose content is described as views over the ontology relations. The contributions of this paper are twofold. We first offer a uniform logical setting which allows us to encompass and to relate the existing work on answering and rewriting queries using views. In particular, we make clearer the connection between the problem of rewriting queries using views and the problem of answering queries using extensions of views. Then we focus on an instance of the problem of rewriting conjunctive queries using views through an ontology expressed in a description logic, for which we exhibit a complete algorithm.
This article presents a simple but scalable framework for peer-to-peer data-sharing systems, in which the problem of answering queries over a network of semantically related peers is always decidable. The approach is characterized by a simple class-based language for defining peer schemas as hierarchies of atomic classes, and mappings as inclusions of logical combinations of atomic classes. We provide an anytime and incremental method for computing all the answers to a query posed to a given peer such that the answers are ordered from ones involving peers close to the queried peer to those involving more distant peers.
An information integration system provides a uniform query interface to a collection of distributed and heterogeneous information sources, giving users or other agents the illusion that they interrogate a centralized and homogeneous information system. In this paper, we focus on the use of knowledge representation techniques for building mediators for information integration. A mediator is based on the specification of a single mediated schema describing a domain of interest, and on a set of source descriptions expressing how the content of each source available to the system is related to the domain of interest. These source descriptions, also called mappings because they model the correspondence between the mediated schema and the schemas of the data sources, play a central role in the query answering process. We present two recent information integration systems, namely Picsel and Xyleme, which are illustrative of two radically different choices concerning the expressivity of the mediated schema.
Xyleme is a huge warehouse integrating XML data of the Web. Xyleme considers a simple data model with data trees and tree types for describing the data sources, and a simple query language based on tree queries with boolean conditions. The main components of the data model are a mediated schema modeled by an abstract tree type, as a view of a set of tree types associated with actual data trees, called concrete tree types, and a mapping expressing the connection between the mediated schema and the concrete tree types. The first contribution of this paper is formal: we provide a declarative model-theoretic semantics for Xyleme tree queries, a way of checking tree query containment, and a characterization of tree queries as a composition of branch queries. The other contributions are algorithmic and handle the potentially huge size of the mapping relation which is a crucial issue for semantic integration and query evaluation in Xyleme. First, we propose a method for pre-evaluating queries at compile time by storing some specific meta-information about the mapping into map translation tables. These map translation tables summarize the set of all the branch queries that can be generated from the mediated schema and the set of all the mappings. Then, we propose different operators and strategies for relaxing queries which, having an empty map translation table, will have no answer if they are evaluated against the data. Finally, we present a method for semi-automatically generating the mapping relation.
In this paper, we consider the problem of searching frequent trees from a collectionof tree-structured data modeling XML data. The TreeFinder algorithm aims at finding trees, such that their exact or perturbed copies are frequent in a collection of labelled trees. To cope with complexity issues, TreeFinder is correct but not complete: it finds a subset of the actually frequent trees. The default of completeness is experimentally investigated on artificial medium size datasets; it is shown that TreeFinder reaches completeness or falls short to it for a range of experimental settings.
This paper deals with the representation of multi-valued data by clustering them in a small number of classes organized in a hierarchy and described at an appropriate level of abstraction. First we investigate a partial order, namely nesting, relating Galois Lattices. Second, we investigate the intensional and extensional aspects of the languages used in our Zoom system. Finally, the nesting order between the Galois lattices corresponding to these various languages are exploited in the interactice system Zoom.
In this paper, we characterize the logical correspondence between conjunctive queries and concept descriptions. We exhibit a necessary and sufficient condition for the compilation of a conjunctive query into an equivalent ALE concept description. We provide a necessary and sufficient condition for the approximation of conjunctive queries by maximally subsumed ALN concept descriptions.
The current development of the Web and the generalization of XML technonology provides a major opportunity to radically change the management of distributed data. We have developed a prototype of a dynamic warehouse for XML data, namely Xyleme. In the present paper, we briefly present some motivation and important aspects of the work performed in the framework of the Xyleme project.
The main issue for a Web ontology language is less the choice of a language for describing ontologies than the choice of a language for representing mappings between ontologies.
An information integration agent provides a uniform query interface to a collection of distributed and heterogeneous information sources, giving users or other agents the illusion that they interrogate a centralized and homogeneous information system. In this chapter we focus on integration information agents that follow a mediator approach. A mediator is based on the specification of a single mediated schema describing a domain of interest, and on a set of source descriptions expressing how the content of each source available to the system is related to the domain of interest. These source descriptions, also called mappings because they model the correspondence between the mediated schema and the schemas of the data sources, play a central role in the query answering process. We present two recent information integration agents, namely Picsel and Xyleme, which are illustrative of two radically different choices concerning the expressivity of the mediated schema.
The paper aims at extracting common tree structures in heterogeneous XML data. This approach extends Frequent Item Set techniques to tree structure representation. We provide an algorithm which extracts all subtrees embedded in the data above a given frequency threshold. As a result, this technique provides both a clustering of the data and a symbolic characterization of each cluster. This characterisation is made of the maximally specific subtrees embedded in all the data trees of each cluster. The scalability of the algorithm is investigated on artificial medium-sized data.
Le nombre croissant de données accessibles via des réseaux (intranet, internet, etc.) pose le problème de l'intégration de sources d'information préexistantes, souvent distantes et hétérogènes, afin de faciliter leur interrogation par un large public. Une des premières approches proposées en intégration d'informations prone la construction de médiateurs. Un médiateur joue un rôle d'interface de requetes entre un utilisateur et des sources de données. Il donne à l'utilisateur l'illusion d'interroger un système homogène et centralisé en lui évitant d'avoir à trouver les sources de données pertinentes pour sa requête, de les interroger une à une, et de combiner lui-même les informations obtenues. L'objectif de cet article est de présenter le projet PICSEL qui offre un environnement déclaratif de construction de médiateurs. PICSEL se distingue des systèmes d'intégration d'informations existants par la possibilité d'exprimer le schéma du médiateur dans un langage (Carin) combinant le pouvoir d'expression d'un formalisme à base de règles et d'un formalisme à base de classes (la logique de description ALN). PICSEL intègre un module d'affinement de requêtes, première brique d'un module de dialogue coopératif entre un médiateur et ses utilisateurs
We are interested in defining and querying large and highly heterogeneous database repository for XML data and their DTDs. We consider a very simple data model with data trees and tree types for describing the data sources and a very simple query language based on tree queries with boolean conditions. The main components of the data model are a mediated schema as a view of a set of concrete tree types and a mapping expressing the connection between the mediated schema and the concrete tree types. We provide a general tree query semantics and a characterization of tree queries as composition of branch queries. We show also that we can pre-evaluate queries by storing some specific meta-information related to mapping information. We consider also the problem of query containment for tree queries and we propose some policies of relaxation for queries without answer. The research presented here was motivated by the Xyleme project at INRIA, whose objective is to develop a data warehouse for Web XML documents.
In many applications, it becomes crucial to help users to access to a huge amount of data by clustering them in a small number of classes described at an appropriate level of abstraction. In this paper, we present an approach based on the use of two languages of description of classes for the automatic clustering of multi-valued data. The first language of classes has a high power of abstraction and guides the construction of a lattice of classes covering the whole set of the data. The second language of classes, more expressive and more precise, is the basis for the refinement of a part of the lattice that the user wants to focus on.
A new approach for constructing pseudo-keywords, referred to as Sense Units, is proposed. Sense Units are obtained by a word clustering process, where the underlying similarity reflects both statistical and semantic properties, respectively detected through Latent Semantic Analysis and WordNet. Sense Units are used to recode documents and are evaluated from the performance increase they permit in classification tasks. Experimental results show that accounting for semantic information in fact { decreases} the performances compared to LSI standalone. The main weaknesses of the current hybrid scheme are discussed and several tracks for improvement are sketched.
In many applications, it becomes crucial to help users to access to a huge amount of data by clustering them in a small number of classes described at an appropriate level of abstraction. In this paper, we present an approach based on the use of two languages of description of classes for the automatic clustering of semistructured data. The first language of classes has a high power of abstraction and guides the construction of a basic lattice of classes covering the whole set of the data. The second language of classes, more expressive and more precise, is the basis for the refinement of a part of the lattice that the user wants to focus on.
Picsel is an information integration system over sources that are distributed and possibly heteregeneous. The approach which has been chosen in Picsel is to define an information server as a knowledge-based mediator in which Carin is used as the core logical formalism to represent both the domain of application and the contents of information sources relevant to that domain. In this paper, we describe the way the expressive power of the Carin language is exploited in the Picsel information integration system, while maintaining the decidability of query answering. We illustrate it on examples coming from the tourism domain, which is the first real case that we have to consider in Picsel, in collaboration with France Telecom (CNET) and the travel agency Degriftour.
In database, rewriting queries using views has received significant attention because of its relevance to several fields such as query optimization, data warehousing, and information integration. In those settings, data used to answer a query are restricted to be extensions of a set of predefined queries (views). The information integration context is typical of the need of rewriting queries using views for answering queries. Users of information integration systems do not pose queries directly to the (possibly remote) sources in which data are stored but to a set of virtual relations that have been designed to provide a uniform and homogeneous access to a domain of interest. When the contents of the sources are described as views over the (virtual) domain relations, the problem of reformulating a user query into a query that refers directly to the relevant sources becomes a problem of rewriting queries using views. In this paper, we study the problem of rewriting conjunctive queries over DL expressions into conjunctive queries using a set of views that are a set of distinguished DL expressions, for three DLs allowing existential restrictions: FLE, ALE and ALEN. For FLE, we present an algorithm that computes a representative set of all the rewritings of a query. In the full version of this paper (cf. Technical report), we show how to adapt it to deal with the negation of atomic concepts in the queries and in the views, in order to obtain a rewriting algorithm for ALE. Finally, we show that obtaining a finite set representative of all the rewritings of a query is not guaranteed in ALEN.
We consider the set of sentences in a decidable fragment of first order logic, restricted to unary and binary predicates, expressed on a finite set of objects. Probability distributions on that set are defined and studied. For large sets of objects, is is shown that under these distributions, random sentences typically have a very particular form, and that all monotone symmetric properties of these random sentences have a probability close to 0 or 1.
In this paper, we address the problem of determining the expansions of a query in description logics possibly combined with Datalog rules, and we show its use for information integration. Expanding a query consists of determining all the ways of deriving it from atoms built on some distinguished predicates.
In this paper, we address the problem of answering (unions of) conjunctive queries on Aboxes. We propose a backward-chaining query evaluation based on query expansion. Expanding a query consists of determining all the ways of deriving it from atoms built on some distinguished predicates.
50 pages
We describe CARIN, a novel family of representation languages, that combine the expressive power of Horn rules and of description logics. We address the issue of providing sound and complete inference procedures for such languages. We identify existential entailment as a core problem in reasoning in CARIN, and describe an existential entailment algorithm for the ALCNR description logic. As a result, we obtain a sound and complete algorithm for reasoning in non recursive CARIN-ALCNR knowledge bases, and an algorithm for rule subsumption over ALCNR. We show that in general, the reasoning problem for recursive CARIN-ALCNR knowledge bases is undecidable, and identify the constructors of ALCNR causing the undecidability. We show two ways in which CARIN-ALCNR knowledge bases can be restricted while obtaining sound and complete reasoning.
This paper is an extended version of two papers that were earlier published in the conferences AAAI'96 and ECAI'96
PICSEL is an information integration system over sources that are distributed and possibly heteregeneous. The approach which has been chosen in PICSEL is to define an information server as a knowledge-based mediator in which CARIN is used as the core logical formalism to represent both the domain of application and the contents of information sources relevant to that domain. The goal of this paper is to point out the way we exploit the expressive power of CARIN in a way that enables avoiding the general problem of query reformulation, by mixing the Global-as-Local and Local-as-Global approaches. We illustrate it on examples coming from the tourism domain, which the first real case that we have to consider in PICSEL, in collaboration with the Web (and Minitel) travel agent Degriftour (http://www.degriftour.fr/)
27 pages
Building complex knowledge based applications requires encoding large amounts of domain knowledge. After acquiring knowledge from domain experts, much of the effort in building a knowledge base goes into verifying that the knowledge is encoded correctly. A knowledge base is verified if it can be shown that certain constraints always hold between the inputs and the outputs. We consider the knowledge base verification problem for Horn rule knowledge bases and for three kinds of constraints: I/O consistency constraints, I/O dependency constraints and Input completeness constraints. For the first two cases, we establish tight complexity results on the problem, and show in what cases it is decidable. In the third case, we show that the problem is, in general, undecidable, and we identify two decidable cases. In our analysis we show how the properties of the problem vary depending on the presence of recursion in the Horn rules, the presence of interpreted predicates, and the presence of negation in the antecedents of the rules. Our approach to the verification problem is based on showing a close relationship to the problem of query containment , studied in the database literature. This connection also provides novel algorithms for the knowledge base verification problem. Finally, we provide the first algorithm for verifying hybrid knowledge bases that combine the expressive power of Horn rules and the description logic ALCNR.
This paper is an extended version of a paper that was earlier published in the conference AAAI'96 and that obtained a best paper award
This paper investigates the issues raised by checking potential update anomalies of a web site. It is of course impossible to control the updates for the whole World Wide Web. However, it is conceivable to provide tools for pointing out potential missing updates for a given web site (i.e, a set of Web pages, for instance related to a given matter or to a given institution, which is under the control of a web site manager). The assumption that we make in this paper is that modeling the semantics of the considered Web site is possible, while modeling the semantics of the sources outside of it, which can be connected to it, is not possible. Therefore, two different update problems have to be distinguished, depending on whether the updates that have to be done inside the web site are triggered by changes occuring inside or outside the web site.
Notions of correctness and completeness of a KB are impossible to capture completely by a formal definition. However, when the knowledge base is represented in a declarative logical formalism, they can be approached by a logical analysis of its contents. A logical analysis of the knowledge base and constraints that are known to hold on the domain enables us to detect anomalies or discrepancies between the knowledge represented in the KB and the domain. This paper describes a unified logical framework for the verification problem of knowledge bases represented by logical rules (i.e., Horn rules and some extensions). We consider several instances of the verification problem, describe algorithms for verification, and establish the computational complexity of the verification problem. In particular, we consider the verification w.r.t. consistency constraints that are specified separately on the inputs and on the outputs. Next, we consider dependency constraints on the relations between inputs and outputs. Finally, we consider several extensions to our logical formalism that enable us to express richer classes of constraints. The extensions involve the use of negation and the use of constructors of description logics in the constraints.
Building a correct knowledge base (KB) is an incremental and difficult process, made of successive verification and revision steps. Verification steps aim at ensuring that a KB is correct. Revision steps aim at modifying the knowledge base in order to eliminate the anomalies that the verification steps have pointed out. In this paper, we propose a method for the revision of an incorrect rule base, based on characterizing the possible causes of the detected inconsistencies. The possible actions of revision that are considered are adding integrity constraints, deleting rules, and more interestingly refining existing rules by adding new conjuncts.
The problem of rewriting queries using views is to find a query expression that uses only a set of views and is equivalent to (or maximally contained in) a given query. Rewriting queries using views is important for query optimization and for applications such as information integration and data warehousing. Description logics are a family of logics that were developed for modeling complex hierarchical structures, and can also be viewed as a query language with an interesting tradeoff between complexity and expressive power. We consider the problem of rewriting queries using views expressed in description logics and conjunctive queries over description logics. We show that if the view definitions do not contain existential variables, then it is always possible to find a rewriting that is a union of conjunctive queries, and furthermore, this rewriting produces the maximal set of answers possible from the views. If the views have existential variables, the rewriting may be recursive. We present an algorithm for producing a recursive rewriting, that is guaranteed to be a maximal one when the underlying database forms a tree of constants. We show that in general, it is not always be possible to find a maximal rewriting.
We describe CARIN, a novel family of representation languages, which integrate the expressive power of Horn rules and of description logics. We address the key issue in designing such a language, namely, providing a sound and complete inference procedure. We identify existential entailment as a core problem in reasoning in CARIN, and describe an existential entailment algorithm for CARIN languages whose description logic component is ALCNR. This algorithm entails several important results for reasoning in CARIN, most notably: (1) a sound and complete inference procedure for non recursive CARIN-ALCNR, and (2) an algorithm for determining rule subsumption over ALCNR.
Description logics are good candidates for representing complex objects. Their expressive and computational properties have been extensively studied and are well understood. However, from a knowledge engineering perspective, the current reasoning services provided by description logic systems do not suffice for representing objects. Adding services (e.g help for debugging) is necessary to make description logic systems useful tools for building object-centered knowledge bases. In this paper, we present a principled glass-box approach both for trace-based debugging and for supporting transparent modeling of whole-part relationship, using description logics. This approach has been proven very useful for a real application of modeling components of power plants.
Building complex knowledge based applications requires encoding large amounts of domain knowledge. After acquiring knowledge from domain experts, much of the effort in building a knowledge base goes into verifying that the knowledge is encoded correctly. We consider the problem of verifying hybrid knowledge bases that contain both Horn rules and a terminology in a description logic. Our approach to the verification problem is based on showing a close relationship to the problem of query containment. Our first contribution, based on this relationship, is presenting a thorough analysis of the decidability and complexity of the verification problem, for knowledge bases containing recursive rules and the standard interpreted predicates. Second, we show that important new classes of constraints on correct inputs and outputs can be expressed in a hybrid setting, in which a description logic class hierarchy is also considered, and we present the first complete algorithm for verifying such hybrid knowledge bases.
Horn rule languages have formed the basis for many Artificial Intelligence application languages, but are not expressive enough to model domains with a rich hierarchical structure. Description logics have been designed especially to model rich hierarchies. Several applications would significantly benefit from combining the expressive power of both formalisms. This paper focuses on combining recursive function-free Horn rules with the expressive description logic ALCNR, and shows exactly when a hybrid language with decidable inference can be obtained. First, we show that several of the core constructors of description logics lead by themselves to undecidability of inference when combined with recursive function-free Horn rules. We then show that without these constructors we obtain a maximal subset of ALCNR that yields a decidable hybrid language. Finally, we describe a restriction on the Horn rules that guarantees decidable inference when combined with all of ALCNR, and covers many of the common usages of recursive rules.