x
Chantal Reynaud, Brigitte Safar-
IEEE Transactions on Knowledge and Data Engineering 2009
Pour identifier des mappings entre les concepts de
deux ontologies, de nombreux travaux récents portent sur l'utilisation
de connaissances complémentaires dites de "background" ou de
support, représentés sous la forme d'une mème
ontologie. Leur objectif commun est de compléter les techniques
classiques d'appariement qui exploitent la structure ou la richesse du
langage de représentation des ontologies, et qui ne s'appliquent
plus quand les ontologies à apparier sont faiblement structurées ou
se limitent à de simples taxonomies. Cet article comporte deux
parties. La première présente une étude de différents
travaux utilisant des connaissances de support, en commençant par
leur schéma général commun, suivi par une analyse des travaux
en fonction du type de connaissance de support utilisée. Une
seconde partie est consacrée au système d'alignement
TaxoMap. Nous présentons le système et son contexte
d'utilisation. Nous décrivons ensuite l'utilisation de WordNet
comme connaissance de support ainsi que les résultats
d'expérimentation obtenus.
Veronique Ventos, Raul Cordovil, David Forge, Thomas Zaslavsky-
IEEE Transactions on Knowledge and Data Engineering 2009 September - 16(1 (R121)):1 - - 31
A gain graph is a graph whose edges are labelled invertibly by
gains from a group. Switching is a transformation of gain graphs
that generalizes conjugation in a group. A weak chromatic function
of gain graphs with gains in a fixed group satisfies three laws:
deletion-contraction for links with neutral gain, invariance under
switching, and nullity on graphs with a neutral loop. The laws are
analogous to those of the chromatic polynomial of an ordinary graph,
though they are different from those usually assumed of gain graphs
or matroids. The three laws lead to the weak chromatic group of gain
graphs, which is the universal domain for weak chromatic functions. We
find expressions, valid in that group, for a gain graph in terms of
minors without neutral-gain edges, or with added complete neutral-gain
subgraphs, that generalize the expression of an ordinary chromatic
polynomial in terms of monomials or falling factorials. These
expressions imply relations for all switching-invariant functions
of gain graphs, such as chromatic polynomials, that satisfy the
deletion-contraction identity for neutral links and are zero on graphs
with neutral loops. Examples are the total chromatic polynomial of
any gain graph, including its specialization the zero-free chromatic
polynomial, and the integral and modular chromatic functions of an
integral gain graph.
We apply our relations to some special integral gain graphs including
those that correspond to the Shi, Linial, and Catalan arrangements,
thereby obtaining new evaluations of and new ways to calculate the
zero-free chromatic polynomial and the integral and modular chromatic
functions of these gain graphs, hence the characteristic polynomials
and hypercubical lattice-point counting functions of the arrangements.
The proof involves gain graphs between the Catalan and Shi graphs
whose polynomials are expressed in terms of descending-path vertex
partitions of the graph of (-1)-gain edges.
We also calculate the total chromatic polynomial of any gain graph
and especially of the Catalan, Shi, and Linial gain graphs.
Nathalie Pernelle, Marie-Christine Rousset, Fatiha Sais-
IEEE Transactions on Knowledge and Data Engineering 2009 - 12:66,94
Veronique Ventos, Henry Soldano, Dominique Bouthinon-
IEEE Transactions on Knowledge and Data Engineering 2009 July - 5632:465- -478
Chantal Reynaud and Brigitte Safar-
IEEE Transactions on Knowledge and Data Engineering 2009 - 28(2)
Le travail décrit dans ce papier porte sur l'intégration de sources d'informations hétérogènes XML au sein d'un serveur d'information selon une approche mixte combinant médiation et entrepôt de données. Ce serveur dispose d'un schéma, ou ontologie, utilisé pour l'accès tant aux sources externes qu'aux données locales. La méthode que nous proposons est une méthode unifiée qui s'appuie sur l'ontologie et permet de réaliser à la fois l'intégration de sources et de données. Notre contribution est double. Elle porte d'une part sur la génération automatisée de mises en correspondance, ou mappings, entre l'ontologie et une nouvelle source à intégrer, d'autre part sur la construction automatique d'adaptateurs (wrappers en anglais) allant de la description du contenu abstrait de cette nouvelle source jusqu'à l'extraction des données. Des expérimentations ont été réalisées sur des données réelles dans le domaine du tourisme.
Nathalie Pernelle, Marie-Christine Rousset, Chantal Reynaud, Brigitte Safar, Fatiha Sais-
IEEE Transactions on Knowledge and Data Engineering 2009 August
This chapter deals with integration of XML heterogeneous sources into a data warehouse with data defined in terms of a global abstract schema or ontology. The authors present an approach supporting the acquisition of data from a set of external sources available for an application of interest including data extraction, data transformation and data integration or reconciliation. The integration middleware that the authors propose extracts data from external XML sources which are relevant according to an RDFS+ ontology, transforms returned XML data into RDF facts conformed to the ontology and reconciles RDF data in order to resolve possible redundancies.
Marie-Christine Rousset, Francois Goasdou, Nada Abdallah-
IEEE Transactions on Knowledge and Data Engineering 2009
This paper provides a decentralized data model and associated algorithms for peer data management systems (PDMS) based on the DL-liteR description logic. Our approach relies on reducing query reformulation and consistency checking for DL-liteR into reasoning in propositional logic. This enables a straightforward deployment of DL-liteR PDMSs on top of SomeWhere, a scalable propositional peer-to-peer inference system. We also show how to use the state-of-the-art Minicon algorithm for rewriting queries using views in DL-liteR in the centralized and decentralized cases.
Serge Abiteboul, Bogdan Marinoiu, Pierre Bourhis-
IEEE Transactions on Knowledge and Data Engineering 2009
Many Web applications are based on dynamic interactions between Web components exchanging flows of information. Such a situation arises for instance in mashup systems or when monitoring distributed autonomous systems. Our work is in this challenging context that has generated recently a lot of attention; see Web 2.0. We introduce the axlog formal model for capturing such interactions and show how this model can be supported efficiently. The central component is the axlog widget defined by one tree-pattern query or more, over an active document (in the Active XML style) that includes some input streams of updates. A widget generates a stream of updates for each query, the updates that are needed to maintain the view corresponding to the query. We exploit an array of known technologies:
datalog optimization techniques such as Differential or MagicSet, constraint query languages, and efficient XML filtering (YFilter). The novel optimization technique we propose is based on fundamental new notions: a relevance (different than that of MagicSet), satisfiability and provenance for active documents. We briefly discuss an implementation of an axlog engine, an application that we used to test the approach, and results of experiments.
Chantal Reynaud, Yolaine Bourda, Fabrice Popineau, Nadjet Zemirline-
IEEE Transactions on Knowledge and Data Engineering 2009
Nowadays, several efforts are focused on re-using generic platforms to create new systems, in order to make the design process easier and faster. Often, the designer has his own models and resources and would like to reuse the generic system over his resources. That means, he has to integrate his models and resources in the system, and then to directly reuse the generic system. But many problems occur. One of them is that the designer needs to translate his models into the specific format that understood by the system and to use the vocabulary specific to that system. Furthermore, he also needs to translate all the instantiations of his models (i.e. the resources and their metadata). We think that this task is tedious and time-consuming and we want to avoid it. Our objective is to allow the designer to reuse his models (his vocabulary) and his models' instantiations without any change of format or vocabulary. For example, a generic Adaptive Hypermedia System (AHS) is made of a generic adaptation model relying on generic user and domain models. The designer would like to integrate his models and instances in the generic models in order to reuse the generic adaptation engine.
Specific systems can be obtained by specializing the generic models. However, this specialization process is not always easy to perform. It has to be supported to make the design process easier and faster. This paper focuses on assisting designers to specialize generic models using their own models. We aim to automate this process which has been so far entirely manual. Our objectives are twofold: to create a support for defining mappings between elements in generic models and elements in the designer's personal models and to help creating consistent and relevant models integrating the generic and specific ones and taking into account the mappings between them. The proposed approach relies on OWL1, a W3C standard and SWRL2, a W3C proposal.
Chantal Reynaud, Brigitte Safar, Haifa Zargayouna, Faycal Hamdi-
IEEE Transactions on Knowledge and Data Engineering 2009 Janvier:409-420
L'alignement d'ontologies est une tâche importante dans les systèmes d'intégration puisqu'elle autorise la prise en compte conjointe de ressources décrites par des ontologies différentes, en identifiant des appariements entre concepts. Avec l'apparition de très grandes ontologies dans des domaines comme la médecine ou l'agronomie, les techniques d'alignement, qui mettent souvent en oeuvre des calculs complexes, se trouvent face à un défi : passer à l'échelle. Pour relever ce défi, nous proposons dans cet article deux méthodes de partitionnement, conçues pour prendre en compte, le plus tôt possible, l'objectif d'alignement. Ces méthodes permettent de décomposer les deux ontologies à aligner en deux ensembles de blocs de taille limitée et tels que les éléments susceptibles d'être appariés se retoruvent concentrés dans un ensemble minimal de blocs qui seront effectivement comparés. Les résultats des tests effectués avec nos deux méthodes sur différents couples d'ontologies montrent leur efficacité.
Serge Abiteboul, Tova Milo, Bogdan Cautis-
IEEE Transactions on Knowledge and Data Engineering 2009
We introduce in this paper a class of constraints for describing how an XML document can evolve, namely XML update constraints. For these constraints, we study the implication problem, giving algorithms and complexity results for constraints of varying expressive power. Besides classical constraint implication, we also consider an instance-based approach in which we take into account data. More precisely, we study implication with respect to a current tree instance, resulting from a series of unknown updates. The main motivation of our work is reasoning about data integrity under update restrictions in contexts where owners may lose control over their data, such as in publishing or exchange.
Serge Abiteboul, Bogdan Marinoiu, Pierre Bourhis-
IEEE Transactions on Knowledge and Data Engineering 2009
Serge Abiteboul, Bogdan Marinoiu, Pierre Bourhis, Alban Galland-
IEEE Transactions on Knowledge and Data Engineering 2009
Towards a data-centric workflow approach, we
introduce an artifact model to capture data and workflow
management activities in distributed settings. The model is built
on Active XML, i.e., XML trees including Web service calls. We
argue that the model captures the essential features of business
artifacts as described informally in [1] or discussed in [2].
To illustrate, we briefly consider the monitoring of distributed
systems and the verification of temporal properties for them.
Chantal Reynaud, Yolaine Bourda, Fabrice Popineau, Nadjet Zemirline-
IEEE Transactions on Knowledge and Data Engineering 2008
The design of Adaptive Hypermedia is a difficult task which can be made easier if generic systems and AH creators' models are reused. We address the design problem in the setting of the GLAM platform only made up of generic components. In this paper, we assume the GLAM platform is used to create a specific adaptive hypermedia. We present a pattern and a rule-based approach helping a AH creator in reusing its user and domain models and instances in order to make them taken into account. This semi-automatic approach takes the creators' models as specialisations of GLAM generic models and requires the creator to express a minimum set of mappings between his models and the generic ones. The process results in a merged model consisting of the generic and the corresponding specific model, being fully compliant with the GLAM adaptation model. A plug-in and experimentations in the e-learning domain have been partially designed.
Marie-Christine Rousset, Philippe Chatalic, Gia-Hien Nguyen-
IEEE Transactions on Knowledge and Data Engineering 2008 march:59-65
Semantic peer to peer (P2P) systems are fully decentralized
overlay networks of people or machines (called peers) sharing
and searching varied resources (documents, videos, photos,
data, services) based on their semantic annotations using
ontologies. They provide a support for the emergence of
open and decentralized electronic social networks, in which
no central or external authority can control the reliability of
the peers participating to the network.
In this paper, we propose a probabilistic model to handle
trust in a P2P setting. It supports a local computation
and a simple form of propagation of the
trust of peers into classes of other peers. We claim that it
is well appropriate to the dynamics of P2P networks and to
the freedom of each peer within the network to have different
viewpoints towards the peers with which it interacts.
Marie-Christine Rousset, Philippe Chatalic, Gia-Hien Nguyen-
IEEE Transactions on Knowledge and Data Engineering 2008 july:881-882
Semantic peer to peer (P2P) systems are fully decentralized
overlay networks of people or machines (called peers) sharing
and searching varied resources (documents, videos, photos,
data, services) based on their semantic annotations using
ontologies. They provide a support for the emergence of
open and decentralized electronic social networks, in which
no central or external authority can control the reliability of
the peers participating to the network.
In this paper, we propose a probabilistic model to handle
trust in a P2P setting. It supports a local computation
and a simple form of propagation of the
trust of peers into classes of other peers. We claim that it
is well appropriate to the dynamics of P2P networks and to
the freedom of each peer within the network to have different
viewpoints towards the peers with which it interacts.
Serge Abiteboul, Bogdan Marinoiu, Pierre Bourhis-
IEEE Transactions on Knowledge and Data Engineering 2008 March
We consider the evolution of active documents, and, more precisely, of Active XML documents with incoming streams of data. A main contribution of the paper is a study of "document satisfiability" defined as follows. A formula is satisfiable for a document if it may hold in some future (depending on data brought by the incoming streams). We show how to evaluate document satisfiability of tree-pattern queries using non-recursive datalog. We characterize the complexity of this problem for important variants of the document model and the query language. We use this notion together
with known datalog techniques (Differential and MagicSet) for maintaining views over active documents. We also exhibit optimization
techniques that consist in pushing filters to the incoming streams and garbage collecting data and streams that are useless with respect to the views of interest.
Chantal Reynaud, Yolaine Bourda, Fabrice Popineau, Nadjet Zemirline-
IEEE Transactions on Knowledge and Data Engineering 2008 July
The design of Adaptive Hypermedia is a difficult task which can be made easier if generic systems and AH creators' models are reused. We address this design problem in the setting of the GLAM platform only made up of generic components. We present a rule-based approach helping an AH creator in reusing its user and domain models to create a specific adaptive hypermedia. This semi-automatic approach takes the creators' models as specialisations of GLA generic models and requires the creator to express a minimum set of mappings between his models and the generic ones. The process results in a merged model consisting of the generic and the corresponding specific model. This merged model can be used by the adaptation model.
Pierre Senellart, Vincent Blondel-
IEEE Transactions on Knowledge and Data Engineering 2008 jan
Francois Goasdou, Nada Abdallah-
IEEE Transactions on Knowledge and Data Engineering 2008
Dans cet article, nous étudions le calcul de conséquences
dans les systèmes d'inférence pair-à -pair (P2PIS) propositionnels
avec des mappings orientés. Dans ces systèmes,
un mapping allant d'un pair vers un autre spécifie un ensemble
de connaissances que le premier pair doit observer,
ainsi que les connaissances qu'il doit notifier au second
pair si les connaissances observées sont satisfaites. Ces
nouveaux P2PIS pouvant modéliser de nombreuses applications
réelles, il est important de les doter d'inférences
clés de l'IA, en l'occurrence du calcul de conséquences.
Nos contributions sont doubles. Nous définissons tout
d'abord le premier cadre logique pour représenter des
P2PIS propositionnels avec des mappings orientés. Nous
étudions ensuite le calcul de conséquences dans ce nouveau
cadre. En particulier, nous proposons un algorithme
totalement décentralisé pour ce problème.
Francois Goasdou, Nada Abdallah-
IEEE Transactions on Knowledge and Data Engineering 2008
Dans un système d'inférence pair-à -pair (P2PIS), un pair étend sa base de connaissances (KB) avec celle des autres pairs afin d'utiliser leurs connaissances pour répondre aux requêtes qui lui sont posées. Toutefois, l'extension d'une KB n'est pas nécessairement conservative. Une extension conservative garantit que le sens d'une KB est le même lorsqu'elle est considérée seule ou avec son extension. En revanche, une extension non conservative peut changer radicalement le sens d'une KB au sein de la théorie résultante. Il est par conséquent crucial pour un pair de savoir si un P2PIS est une extension conservative de sa KB. En effet, si ce n'est pas le cas (i) ses utilisateurs n'ont plus la bonne interprétation de sa KB, donc des requêtes en termes de sa KB et des réponses à celles-ci et (ii) les connaissances qu'il fournit aux autres pairs ne sont pas celles escomptées.
Cet article est le premier à s'intéresser à la notion d'extension conservative dans le cadre des P2PIS. Notre contribution est d'étudier théoriquement et algorithmiquement le problème de tester si un P2PIS propositionnel est une extension conservative d'un pair donné. En particulier, nous recourons au calcul de conséquences afin de reformuler ce problème et d'élaborer un algorithme décentralisé pour le résoudre.
Serge Abiteboul, Pierre Bourhis, Bogdan Marinoiu-
IEEE Transactions on Knowledge and Data Engineering 2008
Observing highly dynamic Peer-to-Peer systems is essential for many applications such as fault management or business processing. We demonstrate
P2PMonitor, a P2P system for monitoring such sys-
tems. Alerters deployed on the monitored peers are designed to detect particular kinds of local events. They generate streams of XML data that form the primary sources of information for P2PMonitor. The core of the system is composed of processing components implementing the operators of an algebra over data streams.
From a user viewpoint, monitoring a P2P system can be as simple as querying an XML document. The document is an ActiveXML document that aggregates a (possibly very large) number of streams generated by alerters on the monitored peers. Behind the scene, P2PMonitor compiles the monitoring query into a distributed monitoring plan, deploys alerters and stream algebra processors and issues notifications that are sent
to users.
The system functionalities are demonstrated by simulating the supply chain of a large company.
Chantal Reynaud, Nicolas Guelfi, Cedric Pruski-
IEEE Transactions on Knowledge and Data Engineering 2008
Finding relevant information on the Web is often difficult for most of the users. Although Web search applications are improving, they still need to be more intellignet to adapt to the search domain targeted by users, the evolution of this domain and users' characteristics. In this paper, we present an experimental assessment of the TARGET framework for improving the relevance of the documents when users are searching the Web by using adaptive ontologies. This is done first by introducing the TARGET approach. We will briefly present the used ontologies and their ability to adapt to domain evolution. We will then detail the TARGET tool used in our experimentations. This includes its architecture, its ability to carry out the ontology adaptation process as well as the way it searches the Web and ranks the returned results. Finally, we discuss the results obtained using the tool through the presentation of our case study devoted to the retrieval of scientific articles.
Chantal Reynaud, Francois-Elie Calvier-
IEEE Transactions on Knowledge and Data Engineering 2008 June
This article focuses on ontology matching in a decentralized setting. The work takes place in the
MediaD project. Ontologies are the description of peers data belonging to the peer data
management system SomeRDFS.
First, we present some particularities of ontology matching in a peer-to-peer setting, the data model
and query answering in SomeRDFS PDMSs. Then, we show how to take advantage of query
answering in order to help discovering new mappings between ontologies, either mapping shortcuts
corresponding to a composition of pre-existent mappings or mappings which can not be inferred
from the network but yet relevant. This work has been partly implemented. We present the results of
some experiments which confirm the interest of our approach.
Chantal Reynaud, Francois-Elie Calvier-
IEEE Transactions on Knowledge and Data Engineering 2008 October
This article focuses on ontology matching in a decentralized setting. The work takes place in the MediaD project. Ontologies are the description of peers data belonging to the peer data management system SomeRDFS. We show how to take advantage of query answering in order to help discovering new mappings between ontologies, either mapping shortcuts corresponding to a composition of pre-existent mappings or mappings which can not ne inferred from the network but yet relevant.
Serge Abiteboul, Tova Milo, Ohad Greenshpan-
IEEE Transactions on Knowledge and Data Engineering 2008
We introduce a formal model for capturing the notion of mashup
in its globality. The basic component in our model is the mashlet.
A mashlet may query data sources, import other mashlets, use
externalWeb services, and specify complex interaction patterns between
its components. A mashlet state is modeled by a set of relations
and its logic specified by datalog-style active rules. We are
primarily concerned with changes in a mashlet state relations and
rules. The interactions with users and other applications, as well as
the consequent effects on the mashlets composition and behavior,
are captured by streams of changes. The model facilitates dynamic
mashlets composition, interaction and reuse, and captures the fundamental
behavioral aspects of mashups.
Pierre Senellart, Georg Gottlob-
IEEE Transactions on Knowledge and Data Engineering 2008 jun
We introduce a theoretical framework for discovering relationships between two database instances over distinct and unknown schemata. This framework is grounded in the context of data exchange. We formalize the problem of understanding the relationship between two instances as that of obtaining a schema mapping so that a minimum repair of this mapping provides a perfect description of the target instance given the source instance. We show that this definition yields "intuitive" results when applied on database instances derived from each other by basic operations. We study the complexity of decision problems related to this optimality notion in the context of different logical languages and show that, even in very restricted cases, the problem is of high complexity.
Chantal Reynaud, Francois-Elie Calvier-
IEEE Transactions on Knowledge and Data Engineering 2008
In this article we propose methods to automate the extraction of alignments or mappings between ontologies by using query answering in the peer
data management system (PDMS) SomeRDFS which uses a data model based on RDF(S). Query answering is composed of a rewriting and an evaluation phase.
We show that query answering provides information that offer automated support for discovering new mappings. It is used to generate either mapping shortcuts corresponding to mapping compositions or mappings which can not be inferred from the network but yet relevant. The strategy followed by a peer and ?ltering criteria de?ned by the administrator of a peer are used to select the most relevant mappings to be represented.
Serge Abiteboul, Ioana Manolescu, Spyros Zoupanos-
IEEE Transactions on Knowledge and Data Engineering 2008
Serge Abiteboul, Ioana Manolescu, Spyros Zoupanos-
IEEE Transactions on Knowledge and Data Engineering 2008
Ioana Manolescu, Angela Bonifati, Andrei Arion, Andrea Pugliese-
IEEE Transactions on Knowledge and Data Engineering 2008 february - 11(1)
Ioana Manolescu, Stefan Manegold-
IEEE Transactions on Knowledge and Data Engineering 2008
Francois Goasdou, Jean Charlet, Alain Leger, Johannes Heinecke, Lyndon J.B. Nixon, Pavel Shvaiko, Paola Hobson-
IEEE Transactions on Knowledge and Data Engineering 2008
Semantic Web technology is being increasingly applied in a large spectrum of applications in which domain knowledge is conceptualized and formalized (e.g., by means of an ontology) in order to support diversified and automated knowledge processing (e.g., reasoning) performed by a machine. Moreover, through an optimal combination of (cognitive) human reasoning and (automated) machine processing (mimicking reasoning); it becomes possible for humans and machines to share more and more complementary tasks. The spectrum of applications is extremely large and to name a few: corporate portals and knowledge management, e-commerce, e-work, e-business, healthcare, e-government, natural language understanding and automated translation, information search, data and services integration, social networks and collaborative filtering, knowledge mining, business intelligence and so on. From a social and economic perspective, this emerging technology should contribute to growth in economic wealth, but it must also show clear cut value for everyday activities through technological transparency and efficiency. The penetration of Semantic Web technology in industry and in services is progressing slowly but accelerating as new success stories are reported. In this chapter we present ongoing work in the cross-fertilization between industry and academia. In particular, we present a collection of application fields and use cases from enterprises which are interested in the promises of Semantic Web technology. The use cases are focused on the key knowledge processing components that will unlock the deployment of the technology in industry. The chapter ends with the presentation of the current state of the technology and future trends as seen by prominent actors in the field.
S. Abiteboul, L. Segoufin, V.Vianu-
IEEE Transactions on Knowledge and Data Engineering 2008
Active XML is a high-level specification language tailored to data-intensive,
distributed, dynamic Web services.
Active XML is based on XML documents with embedded function calls.
The state of a document evolves depending on the result of
internal function calls (local computations) or external ones (interactions
with users or other services). Function calls return documents that may be
active, so may activate new subtasks.
The focus of the paper is on the verification of temporal properties of
runs of Active XML systems, specified in a tree-pattern based temporal logic, Tree-LTL, that allows
expressing a rich class of semantic properties of the application. The main
results establish the boundary of decidability and the complexity of automatic
verification of Tree-LTL properties.
Francois Goasdou, Nada Abdallah-
IEEE Transactions on Knowledge and Data Engineering 2008
Cet article montre en quoi la notion d'extension non conservative d'une base de connaissances (KB) est importante dans les sytèmes d'inférence pair-à -pair (P2PIS), aussi connus sous le nom de systèmes pair-à -pair sémantiques. Cette notion est utile à un pair afin de détecter si (une partie de) sa KB est corrompue par un P2PIS ou pour apprendre du P2PIS de nouvelles connaissances sur son propre domaine d'application. Cette notion est d'autant plus importante qu'elle a des liens étroits avec la confidentialité des connaissances d'un pair au sein d'un P2PIS et avec la qualité de service fournie par un P2PIS.
Nous étudions ici, du point de vue théorique et de l'algorithmique décentralisée, les deux problèmes suivants : (i) décider si un P2PIS est une extension conservative d'un pair donné et (ii) calculer les témoins d'une possible corruption de la KB d'un pair donné par un P2PIS, de sorte à pouvoir l'empêcher. Nous considérons des P2PIS passant à l'échelle d'un millier de pairs et dont l'utilité a déjà été démontrée en Intelligence Artificielle et en Bases de Données.
Chantal Reynaud, Brigitte Safar, Haifa Zargayouna, Faycal Hamdi-
IEEE Transactions on Knowledge and Data Engineering 2008
TaxoMap is an alignment tool which aim is to discover rich correspondences between concepts. It performs an oriented alignment (from a source to a target ontology) and takes into account labels and sub-class descriptions. Our participation in last year edition of the competition have put the emphasis on certain limits. TaxoMap2 is a new implementation of taxoMap that reduces significantly runtime and enables parametrization by specifying the ontology language and different thresholds used to extract different mapping relations. The new implementation stresses on terminological techniques, it takes into account synonymy, and multi-label description of concepts. Special effort was made to handle large-scale ontologies by partitioning input ontologies into modules to align. We conclude the paper by pointing out the necessary improvements that need to be made.
Ioana Manolescu, Andrei Arion, Pierre Senellart, Loredana Afanasiev, Neoklis Polyzotis, Karl Schnaitter, Spyros Zoupanos, Jens Dittrich, Stefan Manegold, Dennis Shasha-
IEEE Transactions on Knowledge and Data Engineering 2008 March - 37(1)
Ioana Manolescu, Philippe Michiels, Cedric Miachon-
IEEE Transactions on Knowledge and Data Engineering 2008 - 33(2):182-202
Chantal Reynaud, Francois-Elie Calvier-
IEEE Transactions on Knowledge and Data Engineering 2008
Dans cet article, nous nous intéresson sà la découverte de mises en correspondance entre ontologies distribuées modélisant les connaissances de pairs du système de gestion de données P2P SomeRDFS. Plus précisément, nous montrons comment exploiter les mécanismes de raisonnement mis en oeuvre dans SomeRDFS pour aider à découvrir des mappings entre ontologies. Ce travail est réalisé dans le cadre du projet MediaD en partenariat avec France Telecom.
Michalis Vazirgiannis, Pierre Senellart, Dimitris Drosos, Akrivi Vlachou-
IEEE Transactions on Knowledge and Data Engineering 2008 apr
In this paper we propose a method for predicting the ranking position of a Web page. Assuming a set of successive past top-k rankings, we study the evolution of Web pages in terms of ranking trend sequences used for Markov Models training, which are in turn used to predict future rankings. The predictions are highly accurate for all experimental setups and similarity measures.
Serge Abiteboul, Benjamin Nguyen, Ioana Manolescu, Francois Goasdou, Philippe Chatalic, Spyros Zoupanos, Gabriel Vasile, Georges Gardarin, Tristan Allard, Anca Ghitescu, Mohamed Ouazara, Aditya Somani, Nicolas Travers-
IEEE Transactions on Knowledge and Data Engineering 2008
We present the WebContent platform for managing distributed repositories of XML and semantic Web data. The platform allows integrating various data processing building blocks (crawling, translation, semantic annotation, full-text search, structured XML querying, and semantic querying), presented as Web services, into a large-scale efficient platform. Calls to various services are combined inside ActiveXML documents, which are XML documents including service calls. An ActiveXML optimizer is used to: (i) efficiently distribute computations among sites; (ii) perform XQuery-specific optimizations by leveraging an algebraic XQuery optimizer; and (iii) given an XML query, chose among several distributed indices the most appropriate in order to answer the query.
Serge Abiteboul, Ioana Manolescu, Nicoleta Preda, Neoklis Polyzotis, Chong Sun-
IEEE Transactions on Knowledge and Data Engineering 2008
S. Abiteboul-
IEEE Transactions on Knowledge and Data Engineering 2007 :1-11
The sharing of content by communities of users (e.g., scientists) in a P2P context remains cumbersome. We argue that main reasons for this is the lack of calculus and algebra for distributed data management. We present the ActiveXML language that extends the XML language with features to handle distribution. More precisely, ActiveXML documents are XML documents with a special syntax for specifying the embedding of Web service calls, e.g. XML queries such as XQueries. We also present ActiveXML algebra that extends ActiveXML notably with explicit control of data exchanges. ActiveXML algebra allows describing query plans, and exchanging them between peers.
Nathalie Pernelle, Marie-Christine Rousset, Fatiha Sais-
IEEE Transactions on Knowledge and Data Engineering 2007 january
Le problème de réconciliation de références consiste à décider si deux
descriptions provenant de sources distinctes réfèrent ou non à la même entité du
monde réel. Dans cet article, nous étudions ce problème quand le schéma des
données est décrit en RDFS étendu par certaines primitives de OWL-DL. Nous
décrivons et montrons l'intérêt d'une approche logique basée sur des règles de
réconciliation qui peuvent être générées automatiquement à partir des axiomes
du schéma. Ces règles traduisent de façon déclarative les dépendances entre réconciliations
qui découlent de la sémantique du schéma. Les premiers résultats
ont été obtenus sur des données réelles dans le cadre du projet PICSEL 3 en
collaboration avec France Telecom R&D.
Pierre Senellart-
IEEE Transactions on Knowledge and Data Engineering 2007 dec
The hidden Web (also known as deep or invisible Web), that is, the part of the Web not directly accessible through hyperlinks, but through HTML forms or Web services, is of great value, but difficult to exploit. We discuss a process for the fully automatic discovery, syntactic and semantic analysis, and querying of hidden-Web services. We propose first a general architecture that relies on a semi-structured warehouse of imprecise (probabilistic) content. We provide a detailed complexity analysis of the underlying probabilistic tree model. We describe how we can use a combination of heuristics and probing to understand the structure of an HTML form. We present an original use of a supervised machine-learning method, namely conditional random fields, in an unsupervised manner, on an automatic, imperfect, and imprecise, annotation based on domain knowledge, in order to extract relevant information from HTML result pages. So as to obtain semantic relations between inputs and outputs of a hidden-Web service, we investigate the complexity of deriving a schema mapping between database instances, solely relying on the presence of constants in the two instances. We finally describe a model for the semantic representation and intensional indexing of hidden-Web sources, and discuss how to process a user's high-level query using such descriptions.
Luc Segoufin, Cristina Sirangelo-
IEEE Transactions on Knowledge and Data Engineering 2007
In this paper we investigate the problem of validating, with constant memory,streaming XML documents with respect to a DTD. Such constant memory validations can only be performed for some but not all DTDs. This paper gives a non trivial interesting step towards characterizing those DTDs for which a constant-memory on-line algorithm exists.
Chantal Reynaud, Francois-Elie Calvier-
IEEE Transactions on Knowledge and Data Engineering 2007
Dans cet article, nous nous intéressons à la découverte de mises en correspondance entre ontologies modélisant les connaissances des pairs d'un système pair à pair de gestion de données (PDMS). Ce travail est réalisé dans le cadre du projet MediaD au sein de la plate-forme SomeRDFS dans le contexte duquel les ontologies, les données et les mappings sont exprimées en RDF(S). Nous présentons ce contexte dans une première partie. Nous précisons ensuite les spécificités du processus de mise en correpsondance dans le cadre de PDMS par opposition au contexte centralisé. Nous montrons enfin comment le raisonnement effectué dans un PDMS peut aider à découvrir des mappings entre ontologies et décrivons quelques techniques de découverte de mappings utilisables.
Victor Vianu, Luc Segoufin, Alan Nash-
IEEE Transactions on Knowledge and Data Engineering 2007
Suppose we are given a set of exact conjunctive views V and a conjunctive query Q. Suppose we wish to answer Q using V, but the classical test for the existence of a conjunctive rewriting of Q using V answers negatively. What can we conclude: (i) there is no way Q can be answered using V, or (ii) a more powerful rewriting language may be needed. This has been an open question, with conventional wisdom favoring (i). Surprisingly, we show that the right answer is actually (ii).
That is, even if V provides enough information to answer Q, it may not be possible to rewrite Q in terms of V using just conjunctive queries --
in fact, no monotonic language is sufficiently powerful. We also exhibit several well-behaved classes of conjunctive views and queries for which conjunctive rewritings remain sufficient.
This continues a previous investigation of rewriting and its connection to semantic
determinacy, for various query and view languages.
Bogdan Cautis-
IEEE Transactions on Knowledge and Data Engineering 2007 June
With more and more information being exchanged or published on the
Web or in peer-to-peer, and with the significant growth in numbers
of distributed, heterogeneous data sources, issues like access
control and data privacy are becoming increasingly complex and
difficult to manage. Very often, when dealing with sensitive
information in such settings, the specification of access control
policies and their enforcement are no longer handled by the actual
data sources, and are (partially) delegated to third-parties.
Besides practical reasons, this is the case when decisions regarding
access depend on factors which overpass the scope and knowledge of
some of the entities involved. More specifically, policies may
depend on \emph{private} aspects concerning users (accessing data)
or data owners. In this case, the only solution is to entrust some
third-party authority with all the information needed to apply
access policies. However, as the policies themselves depend on
sensitive information, this outsourcing raises new privacy issues,
that were not present in centralized environments. In particular,
information leaks may occur during access control enforcement.
In this paper, we consider these issues and, starting from
non-conventional digital signatures, we take a first step towards an implementation solution for such settings where both data
and access policies are distributed. Our approach involves rewriting
user queries into forms which are authorized, and we illustrate this
for both structured (relational) and semi-structured (XML) data and
queries.
Serge Abiteboul, Bogdan Marinoiu-
IEEE Transactions on Knowledge and Data Engineering 2007
In this paper, we are concerned with the distributed monitoring of P2P systems. We ntroduce the P2P Monitor system and a new declarative language, namely P2PML, for specifying monitoring tasks. A P2PML subscription is compiled into a distributed algebraic plan which is described using algebra over XML streams. The operators of this algebra are first alerters in charge of detecting specific events and acting as stream sources. Other operators process the streams or publish them.
We introduce a filter for streams of XML documents that scales by processing first simple conditions and then, if still needed, evaluating complex queries. We also show how particular tasks can be supported by identifying subtasks that
are already provided by existing streams.
Serge Abiteboul, Dan Vodislav, Radu Pop, Itay Dar, Gabriel Vasile-
IEEE Transactions on Knowledge and Data Engineering 2007 June - 234:209,215
The open-source software communities currently face an increasing complexity of managing the software content among theirs developers and contributors. This is mainly due to the continuously growing size of the software, the high frequency of the updates, and the heterogeneity of the participants. We propose a distribution system that tackles two main issues in the software content management: efficient content dissemination through a P2P system architecture, and advanced information system capabilities, using a distributed index for resource location.
Chantal Reynaud, Brigitte Safar-
IEEE Transactions on Knowledge and Data Engineering 2007
A lot of alignment systems providing mappings between the concepts of two ontologies rely on an additional source, called background knowledge, represented most of the time by a third ontology. The objective is to complement others current matching techniques. In this paper, we present the difficulties encountered when using WordNet as background knowledge and we show how the TaxoMap system we implemented can avoid those difficulties.
Pierre Senellart, Yann Ollivier-
IEEE Transactions on Knowledge and Data Engineering 2007 jul
We introduce a new method for finding nodes semantically related to a given node in a hyperlinked graph: the Green method, based on
a classical Markov chain tool. It is generic, adjustment-free and easy to implement. We test it in
the case of the hyperlink structure of the English version of Wikipedia, the on-line encyclopedia. We present an extensive comparative study of the performance of our method versus several other classical methods in the case of Wikipedia. The Green method is found to have both the best average results and the best robustness.
Serge Abiteboul, Bogdan Marinoiu, Pierre Bourhis-
IEEE Transactions on Knowledge and Data Engineering 2007
In this paper, we develop algorithmic datalogbased
foundations for the incremental processing
of tree-pattern queries over active documents, i.e.
document with incoming streams of data.
We define one query satisfiability for such documents based on a logic with 3-values: "true", "false forever", and "false for now". Also, given an active document and a query, part of the document (and in particular, some incoming streams) may become irrelevant for the query of interest. We introduce an incremental algorithm for detecting such useless data and streams, essential for implementing garbage collection. We also provide complexity analysis for the problems we study.
Fatiha Sais, Nathalie Pernelle and Marie-Christine Rousset-
IEEE Transactions on Knowledge and Data Engineering 2007 July:329--334
The reference reconciliation problem consists in deciding
whether different identifiers refer to the same data,
i.e., correspond to the same world entity. The L2R system
exploits the semantics of a rich data model, which
extends RDFS by a fragment of OWL-DL and SWRL
rules. In L2R, the semantics of the schema is translated
into a set of logical rules of reconciliation, which are
then used to infer correct decisions both of reconciliation
and no reconciliation. In contrast with other approaches,
the L2R method has a precision of 100% by
construction. First experiments show promising results
for recall, and most importantly significant increases
when rules are added.
Serge Abiteboul, Dan Vodislav, Nicoleta Preda, Radu Pop, Itay Dar, Gabriel Vasile-
IEEE Transactions on Knowledge and Data Engineering 2007
Open-source software communities currently face an increasing complexity in managing and distributing software content among their developers and contributors. This is mainly due to the continuously growing size of the software, of the community, the high frequency of updates, and the heterogeneity of the participants. We propose a large scale P2P
distribution system that tackles two main issues in software content management: efficient content dissemination and advanced information system capabilities.
Chantal Reynaud, Nicola Guelfi, Cedric Pruski-
IEEE Transactions on Knowledge and Data Engineering 2007
L'utilisation des technologies du Web Sémantique s'est généralisée au cours de ces dernières années. Ceci est vrai, en particulier, pour les langages de définition d'ontologies nécessitant l'adaptation des outils basés sur l'exploitation de ce type de ressources terminologiques pour qu'ils fonctionnent à partir de connaissances représentées dans le langage standard OWL. Cet article concerne la recherche d'information sur le Web basée sur l'utilisation d'ontologies. Notre contribution est double. Dans un premier temps, nous étudions comment exploiter les connaissances représentées en OWL dans O3, une approche de recherche de documents sur le Web, et nous proposons d'étendre ce langage en intégrant de nouvelles relations sémantiques. Dans un deuxième temps, nous montrons comment utiliser les ontologies construites en OWL et son extension pour l'expansion de requêtes afin d'améliorer la précision des résultats lors de recherche de documents sur le Web. Quelques résultats d'expérimentation obtenus à l'aide du prototype TARGET sont présentés.
Serge Abiteboul, Bogdan Marinoiu-
IEEE Transactions on Knowledge and Data Engineering 2007
In this paper, we are concerned with the distributed
monitoring of P2P systems. We introduce
the P2P Monitor system and a new declarative language, namely P2PML, for specifying monitoring tasks. A subscription is compiled into a distributed algebraic plan which is described using an algebra over XML streams.
We introduce a Filtler for streams of XML documents that scales by processing first simple
conditions and then, if still needed, evaluating
complex queries. We also show how particular
tasks can be supported by identifying subtasks
already provided by existing streams.
Serge Abiteboul, Pierre Senellart-
IEEE Transactions on Knowledge and Data Engineering 2007 Jun
In [3], we introduced a framework for querying and updating probabilistic information over unordered labeled trees, the probabilistic tree model. The data model is based on trees where nodes are annotated with conjunctions of probabilistic event variables. We briefly described an implementation and scenarios of usage. We develop here a mathematical foundation for this model. In particular, we present complexity results. We identify a very large class of queries for which simple variations of querying and updating algorithms from [3] compute the correct answer. A main contribution is a full complexity analysis of queries and updates. We also exhibit a decision procedure for the equivalence of probabilistic trees and prove it is in co-RP. Furthermore, we study the issue of removing less probable possible worlds, and that of validating a probabilistic tree against a DTD. We show that these two problems are intractable in the most general case.
Serge Abiteboul, Tova Milo, Neoklis Polyzotis, Karl Schnaitter-
IEEE Transactions on Knowledge and Data Engineering 2007
This paper introduces COLT (Continuous On-Line Tuning),
a novel framework that continuously monitors the
workload of a database system and enriches the existing
physical design with a set of effective indices. The key idea
behind COLT is to gather performance statistics at different
levels of detail and to carefully allocate profiling resources
to the most promising candidate configurations. Moreover,
COLT uses effective heuristics to self-regulate its own performance,
lowering its overhead when the system is well
tuned and being more aggressive when the workload shifts
and it becomes necessary to re-tune the system. We describe
an implementation of the proposed framework in the
PostgreSQL database system and evaluate its performance
experimentally. Our results validate the effectiveness of
COLT and demonstrate its ability to modify the system configuration
in response to changes in the query load.
Serge Abiteboul, Ioana Manolescu, Spyros Zoupanos-
IEEE Transactions on Knowledge and Data Engineering 2007
The system we propose to present, OptiMAX, applies the principles of distributed query optimization to the problem of distributed evaluation of continuous XML queries. OptiMAX is an optimizer for Active XML documents (AXML in short). It is implemented as a module which can be used next to an AXML peer, and it may be invoked whenever users ask queries on their AXML documents. The optimizer draws up an initial query plan, and then attempts to rewrtite it using a combination of heuristics and cost information in order to improve the plan's performance estimates. An interesting feature is that all plans are AXML documents themselves. When the optimizer has retained a plan, it hands it to the AXML peer, which evaluates it directly following the decisions taken by the optimizer.
Benjamin Nguyen, Ioana Manolescu, Florin Dragan, Nicoleta Preda, Radu Pop, Bogdan Butnaru, Georges Gardarin, Laurent Yeh-
IEEE Transactions on Knowledge and Data Engineering 2007 April
The current abundance and complexity of P2P architectures makes it extremely difficult to assess their performance. P2PTester is the first tool devised to interface with, and measure the performance of, existing P2P data management platforms. We isolate basic components present in current P2P platforms, and insert "hooks" for P2PTester to capture, analyze and trace the interactions taking place in the underlying distributed system.
Nathalie Pernelle, Fatiha Sais-
IEEE Transactions on Knowledge and Data Engineering 2007 january
L'intégration de données provenant de différentes sources se confronte
à deux problèmes majeurs : l'hétérogénéité des schémas et les variations dans les
descriptions des instances. Nous avons comparé ces deux problèmes en se focalisant
sur l'existence de méthodes permettant de faire face au passage à l'échelle
et donc au traitement de données nombreuses et éventuellement distribuées.
Avin Mittal-
IEEE Transactions on Knowledge and Data Engineering 2007 jul
Hidden databases on the web are an important source of information, which are not usually indexed by traditional search engines. Documents on the hidden web are not reachable by following the hyperlinked structure of the graph, and so, the need is being felt for a system which can automatically discover, understand and index hidden web documents. In this article, we present a system which analyzes the structure of HTML forms, annotates the various fields in the form with concepts from the knowledge domain, probes the fields with domain specific queries, and analyzes the response pages to find out whether or not they contain any results, to further confirm the annotations. The system also wraps the given HTML form into a web service described by a WSDL document, whose inputs are labeled by the domain concepts and which can be used to query the form to get response pages.
Bogdan Cautis, Alin Deutsch, Nicola Onose-
IEEE Transactions on Knowledge and Data Engineering 2007 Avril
We study the problem of querying data sources that accept only alimited set of queries, such as sources accessible by Web serviceswhich can implement very large (potentially infinite) families ofqueries. We revisit a classical setting in which theapplication queries are conjunctive queries and the source acceptsfamilies of (possibly parameterized) conjunctivequeries specified as the expansions of a(potentially recursive) Datalog program with parameters.We say that query $Q$ is{em expressible} by the program calP if it is equivalent to someexpansion of calP. $Q$ is {em supported} by calP if it has anequivalent rewriting using some finite set of calP's expansions.We present the first study of expressibility and support for sourcesthat satisfy integrity constraints, which is generally the casein practice.We start by closing a problem left open by prior work even whileignoring constraints, namely the precise relationship betweenexpressibility and support: surprisingly, we show them to beinter-reduciblein PTIME in both the absence and the presence of constraints. This enablesus to employ the same algorithm for solving both problems.We identify practically relevant restrictions on theprogram specifying the views that ensure decidability under a mix ofkey and weakly acyclic foreign key constraints, and beyond.We show that these restrictions are as permissive as possible,since their slightest relaxation leads to undecidability.We present an algorithm that is guaranteed to be sound when appliedto unrestricted input and in addition complete under the restrictions.As a side-effect of ourinvestigation, we improve the previously best known upper bound fordeciding support in the constraint-free case.
Serge Abiteboul, Tova Milo, Bogdan Cautis-
IEEE Transactions on Knowledge and Data Engineering 2007 June
We introduce in this paper a class of constraints for describing how
an XML document can evolve, namely \emph{XML update constraints}.
For these constraints, we study the implication problem, giving
algorithms and complexity results for constraints of varying
expressive power. Besides classical constraint implication, we also
consider an instance-based approach. More precisely, we study
implication with respect to a current tree instance, resulting from
a series of unknown updates. The main motivation of our work is
reasoning about data integrity under update restrictions in contexts
where owners may lose control over their data, such as in publishing
or exchange.
Fatiha Sais, Nathalie Pernelle and Marie-Christine Rousset-
IEEE Transactions on Knowledge and Data Engineering 2007 June:521--532
Le problème de réconciliation de références est un problème majeur pour l'intégration ou la fusion de données provenant de plusieurs sources. Il consiste à décider si deux descriptions provenant de sources distinctes réfèrent ou non à la même entité du monde réel. Cependant, la tâche de réconciliation de
références est souvent confrontée à des grandes quantités de données mais aussi à des contraintes de performances en temps d'exécution. En vue d'un passage à l'échelle, nous présentons dans cet article, des techniques que nous intégrons dans notre approche logique L2R de réconciliation de références.
Luc Segoufin, Michael Benedikt-
IEEE Transactions on Knowledge and Data Engineering 2007
We consider regular languages of labeled trees. We give an algebraic characterization of the regular languages over such trees that are definable in first-order logic in the language of labeled graphs. These languages are the analog on trees of the ``locally threshold testable'' languages on strings. We show that this characterization yields a decision procedure for determining whether a regular collection of trees is first-order definable: the procedure is polynomial time in the minimal automaton presenting the regular language. We also give a characterization of definability in first-order logic supplemented with modular quantifiers, and a corresponding decision procedure.
Serge Abiteboul, Dan Vodislav, Radu Pop, Gabriel Vasile-
IEEE Transactions on Knowledge and Data Engineering 2007
Scalability is a key issue in current content distribution systems for web communities, which have to disseminate increasingly large amounts of data to increasingly large communities of users. We believe
that peer-to-peer (P2P) architectures provide good solutions for scalable distribution of large amounts of data over the web. This paper presents EDOS, a P2P system for open-source software distribution, able to disseminate gigabytes of data to thousands of users on the web. We focus on the publishing and querying functionalities and we present several experiments realized on the Grid'5000 network, that
confirm the system's scalability.
Serge Abiteboul, Dan Vodislav, Radu Pop, Itay Dar, Gabriel Vasile-
IEEE Transactions on Knowledge and Data Engineering 2007 February
The open-source software communities currently face an increasing complexity of managing the software content among theirs developers and contributors. This is mainly due to the continuously growing size of the software, the high frequency of the updates, and the heterogeneity of the participants. We propose a distribution system that tackles two main issues in the software content management: efficient content dissemination through a P2P system architecture, and advanced information system capabilities, using a distributed index for resource location. We believe that both aspectsare essential in improving the performances of the general system and each particular choice in the system's architecture completes the overall functionalities.
Marie-Christine Rousset, Francois Goasdou, Philippe Adjiman-
IEEE Transactions on Knowledge and Data Engineering 2007 - 8
The Semantic Web envisions a world-wide distributed architecture where computational resources will easily inter-operate to coordinate complex tasks such as query answering. Semantic marking up of web resources using ontologies is expected to provide the necessary glue for making this vision work. Using ontology languages, (communities of) users will build their own ontologies in order to describe their own data. Adding semantic mappings between those ontologies, in order to semantically relate the data to share, gives rise to the Semantic Web: data on the web that are annotated by ontologies networked together by mappings. In this vision, the Semantic Web is a huge semantic peer data management system. In this paper, we describe the SomeRDFS peer data management systems that promote a "simple is beautiful" vision of the Semantic Web based on data annotated by RDFS ontologies.
Chantal Reynaud, Brigitte Safar, Haifa Zargayouna-
IEEE Transactions on Knowledge and Data Engineering 2007
This paper presents our first participation in the OAEI 2007 campaign. It describes an approch to align taxonomies which relies on terminological and structural techniques applied sequentially. We performed our method with various taxonomies using our prototype, taxoMap. Previous experimental results were encouraging and demonstrate the relevance of this alignment approcah. In this paper, we evaluate the strengths and the weaknesses of taxoMap int he context of the OAEI campaign where the ontologies to align are different from taxonomies we are used to deal with.
Chantal Reynaud, Brigitte Safar, Francois-Elie Calvier-
IEEE Transactions on Knowledge and Data Engineering 2007 0ctober
Pour identifier des mappings entre les concepts de deux ontologies, de nombreux travaux récents portent sur l'utilisation de connaissances complémentaires dites de "background" ou de support, représentées le plus souvent sous la forme d'une 3ème ontologie. Leur objectif commun est de compléter les techniques classiques d'appariement qui exploitent la structure ou la richesse du langage de représentation des ontologies, et qui ne s'appliquent plus quand les ontologies à apparier sont faiblement structurées ou se limitent à de simples hiérarchies de classification. Cet article présente une analyse comparative des travaux utilisant des connaissances de support, en commençant par leur schéma général commun, suivi par une analyse des travaux en fonction du type de connaissance de support utilisée. Nousétudins ensuite les probèmes rencontrés lorsque la connaissance de support est WordNet puis nous montrons comment notre système taxoMap résout ces difficultés.
Chantal Reynaud, Brigitte Safar-
IEEE Transactions on Knowledge and Data Engineering 2007
Le travail décrit dans cet article a pour objectif d'unifier l'accès aux documents d'un domaine d'application. Il permet en particulier d'augmenter le nombre de documents accessibles à partir de portails Web sans en modifier l'interface d'interrogation. L'accès aux documents est supposé s'appuyer sur des taxonomies que nous proposons d'aligner. Cet article porte spécifiquement sur les techniques structurelles mises en oeuvre, qui sont originales et particulières dans la mesure où elles sont adaptées au traitement de taxonomies dont les structures sont hétérogènes et dissymétriques. Nous présentons et analysons ensuite les résultats de trois expérimentations effectuées sur diverses taxonomies, des taxonomies réelles qui ont motivé notr approche ainsi que des taxonomies test mises à disposition des chercheurs de la communauté.
Serge Abiteboul, Neoklis Polyzotis-
IEEE Transactions on Knowledge and Data Engineering 2007
Information ubiquity has created a large crowd of users (most
notably scientists), who could employ DBMS technology to process and
share their data more effectively. Still, this user base prefers to
keep its data in files that can be easily managed by applications such
as spreadsheets, rather than deal with the complexity and rigidity
of modern database systems.
In this paper, we propose a vision for enabling non-experts, such as
scientists, to build \emph{content sharing communities} in a true
database fashion: declaratively. The proposed infrastructure, called
the {\em data ring}, enables users to manage and share their data
with minimal effort; the user points to the data that should be
shared, and the data ring becomes responsible for automatically
indexing the data (to make it accessible), replicating it (for
availability), and reorganizing its physical storage (for better
query performance). We outline the salient features of our proposal,
and outline research challenges that must be addressed in order to
realize this vision.
Serge Abiteboul, Neoklis Polyzotis-
IEEE Transactions on Knowledge and Data Engineering 2007
Information ubiquity has created a large crowd of users (most
notably scientists), who could employ DBMS technology to process and
share their data more effectively. Still, this user base prefers to
keep its data in files that can be easily managed by applications such
as spreadsheets, rather than deal with the complexity and rigidity
of modern database systems.
In this paper, we propose a vision for enabling non-experts, such as
scientists, to build \emph{content sharing communities} in a true
database fashion: declaratively. The proposed infrastructure, called
the {\em data ring}, enables users to manage and share their data
with minimal effort; the user points to the data that should be
shared, and the data ring becomes responsible for automatically
indexing the data (to make it accessible), replicating it (for
availability), and reorganizing its physical storage (for better
query performance). We outline the salient features of our proposal,
and outline research challenges that must be addressed in order to
realize this vision.
Luc Segoufin, Michael Benedikt-
IEEE Transactions on Knowledge and Data Engineering 2007
This work deals with the expressive power of logics on finite structures with access to an additional ``arbitrary'' linear order. The queries that can be expressed this way are the order-invariant queries for the logic. For thestandard logics used in computer science, such as first-order logic, it is known that access to an arbitrary linear order increases the expressiveness of the logic. However, when we lookat the separating examples, we find that they have satisfying models whose Gaifman Graph is complex -- unbounded in valence and in treewidth. We thus explore the expressiveness of order-invariant queries over graph-theoretically well-behaved structures. We prove that first-order order-invariant queries over strings andtrees have no additional expressiveness over first-order logic in the original signature. We also prove new upper bounds on order-invariant queries over bounded treewidth and boundedvalence graphs. Our results make use of a new technique of independent interest: the application of algebraic characterizations of definability to show collapse results.
Was presented at CSL 2005.
Chantal Reynaud, Nicolas Guelfi, Cedric Pruski-
IEEE Transactions on Knowledge and Data Engineering 2007
The evolution of Web information is of utmost importance in the design of good Web Information Systems applications. New emerging paradigms, like the Semantic Web, use ontologies for describing metadata and are defined, in part to aid in Web evolution. In this chapter, we survey techniques for ontology evolution. After identifying the different kinds of evolution the Web is confronted with, we detail the various existing languages and techniques devoted to web data evolution, with particular attention to semantic Web concepts, and how these languages and techniques can be adapted to evolving data in order to improve the quality of Web Information Systems applications.
Chantal Reynaud, Nicolas Guelfi, Cedric Pruski-
IEEE Transactions on Knowledge and Data Engineering 2007
Ontologies which represent domain knowledge in information systems are efficient to enhance information retrieval. However, information is evolving over time and thus it should be also expressible at ontology level. Unfortunately, we consider that ontology evolution is barely study and its basic principles have not been yet precisely defined according to our notion of evolution. In this paper, we have followed a bottom-up approach consisting in a rigourous analysis of the evolution of a particular domain over a significant period of time (namely the WWW series of conference over a decade) to highlight concrete domain knowledge evolutions. We then have generalized and we present a precise set of evolution features that should be offered by ontology metamodels. We also evaluate the modelling capabilities of OWL to represent these features and finally, we show the contribution of ontology evolution support to improve Web information retrieval.
Chantal Reynaud, Brigitte Safar-
IEEE Transactions on Knowledge and Data Engineering 2007
Après avoir décrit l'approche générale mise en oeuvre dans TaxoMap, nous présenterons tout d'abord les aspects favorisant son application à grande échelle. Nous nous focaliserons ensuite sur une des techniques de TaxoMap, spécifique au rapprochement de modèle sontologiques pauvres sémantiquement. Plus pécisément, nous traiterons de l'utilisation de connaissances supplémentairespour pallier l'insuffisance de connaissances contenues dans les modèles alignés. Enfin, nous effectuerons uen analyse comparative de cette technique avec l'état de l'art
Ioana Manolescu, Angela Bonifati, Andrei Arion, Andrea Pugliese-
IEEE Transactions on Knowledge and Data Engineering 2007 5 - 7(2)
XML compression has gained prominence recently because it counters the disadvantage of the
verbose representation XML gives to data. In many applications, such as data exchange and data
archiving, entirely compressing and decompressing a document is acceptable. In other applications,
where queries must be run over compressed documents, compression may not be beneficial since
the performance penalty in running the query processor over compressed data outweighs the data
compression benefits. While balancing the interests of compression and query processing has re-
ceived significant attention in the domain of relational databases, these results do not immediately
translate to XML data.
In this article, we address the problem of embedding compression into XML databases without
degrading query performance. Since the setting is rather different from relational databases, the
choice of compression granularity and compression algorithms must be revisited. Query execution
in the compressed domain must also be rethought in the framework of XML query processing due
to the richer structure of XML data. Indeed, a proper storage design for the compressed data plays
a crucial role here.
The XQueC system (XQuery Processor and Compressor) covers a wide set of XQuery queries
in the compressed domain and relies on a workload-based cost model to perform the choices of
the compression granules and of their corresponding compression algorithms. As a consequence,
XQueC provides efficient query processing on compressed XML data. An extensive experimental assessment is presented, showing the effectiveness of the cost model, the compression ratios, and
the query execution times.
Marie-Christine Rousset., Chan Le Duc, Nhan Le Thanh-
IEEE Transactions on Knowledge and Data Engineering 2006 - 19(3):239--273
This paper introduces a compact representation which helps to avoid the exponential blow-up in space of the Least Common Subsumer (lcs) of two ALE-concept descriptions. Based on the compact representation we define a space of specific graphs which represents all ALE-concept descriptions including the lcs. Next, we propose an algorithm exponential in time and polynomial in space for deciding subsumption between concept descriptions represented by graphs in this space. These results provide better understanding of the double exponential blow-up of the approximation of ALC-concept descriptions by ALE-concept descriptions: double exponential size of the approximation in the ordinary representation is unavoidable in the worst case.
Philippe Chatalic, Gia-Hien Nguyen, Marie-Christine Rousset.-
IEEE Transactions on Knowledge and Data Engineering 2006 August:352--357
In a peer-to-peer inference system, there is no centralized control or hierarchical organization: each peer is equivalent in functionality and cooperates with other peers in order to solve a collective reasoning task. Since peer theories model possibly different viewpoints, even if each local theory is consistent, the global theory may be inconsistent. We exhibit a distributed algorithm detecting inconsistencies in a fully decentralized setting. We provide a fully distributed reasoning algorithm, which computes only well-founded consequences of a formula, i.e., with a consistent set of support.
Marie-Christine Rousset, Philippe Adjiman, Philippe Chatalic, Francois Goasdoué, Laurent Simon.-
IEEE Transactions on Knowledge and Data Engineering 2006 January
In this paper, we describe the SomeWhere semantic peer-to-peer data management system that promotes a "small is beautiful" vision of the Semantic Web based on simple personalized ontologies (e.g., taxonomies of classes) but which are distributed at a large scale. In this vision of the Semantic Web, no user imposes to others his own ontology. Logical mappings between ontologies make possible the creation of a web of people in which personalized semantic marking up of data cohabits nicely with a collaborative exchange of data. In this view, the Web is a huge peer-to-peer data management system based on simple distributed ontologies and mappings.
Serge Abiteboul, Ioana Manolescu, Emanuel Taropa-
IEEE Transactions on Knowledge and Data Engineering 2006
Yannis Papakonstantinou, Veronique Benzaken, Ioana Manolescu, Andrei Arion, Ravi Vijay-
IEEE Transactions on Knowledge and Data Engineering 2006 :13-25
Query processing performance in XML databases can be greatly enhanced by the usage of materialized views whose content has been stored in the database. This requires a method for identifying query subexpressions matching the views, a process known as view-based query rewriting. This process is quite complex for relational databases, and all the more daunting on XML databases.
Current XML materialized view proposals are based on tree patterns, since query navigation is conceptually close to such patterns. However, the existing algorithms for extracting tree patterns from XQuery do not detect patterns across nested query blocks. Thus, complex, useful tree pattern views may be missed by the rewriting algorithm. We present a novel tree pattern extraction algorithm from XQuery queries, able to identify larger patterns than previous methods.
Chantal Reynaud, Brigitte Safar, Hassen Kefi-
IEEE Transactions on Knowledge and Data Engineering 2006
Integrer des sources d'information heterogenes permet un acces unifie sans modification du contenu. Les schemas des sources sont mis en correspondance de facon a ce que qu'il soit possible d'acceder a tout un ensemble de documents provenant de sources multiples, a partir d'un systeme d'interrogation unique. La specification de ces mises en correspondance, ou mappings, est ainsi au coeur de l'integration. Nous proposons d'utiliser differentes techniques d'alignement de taxonomies pour automatiser leur generation. Ces techniques ont ete implementees dans un outil logiciel TaxoMap qui recherche les mappings et, en cas d'echec, donne des indications pour aider les utilisateurs a les specifier eux-memes. Nous presentons et discutons des resultats issus d'experiences realisees dans le domaine de la microbiologie. Nous testons TaxoMap sur differentes taxonomies issues de domaines varies.
Serge Abiteboul, Neoklis Polyzotis-
IEEE Transactions on Knowledge and Data Engineering 2006
Because of information ubiquity, one observes an important trend
towards transferring information management tasks from database
systems to networks. We introduce the notion of Data Ring that can be
seen as a network version of a database or a content warehouse. A main
goal is to achieve better performance for content management without
requiring the acquisition of explicit control over information
resources. We discuss the main traits of Data Rings and argue that
Active XML provides an appropriate basis for such systems.
The collaborating peers that form the Data Ring are autonomous,
heterogeneous and their capabilities may greatly vary, e.g., from a
sensor to a large database. To support effectively this paradigm of
loose integration, the Data Ring enforces a seamless transition
between data and metadata and between explicit and intentional
data. It does not distinguish between data provided by web pages and
data provided by web services, between local (extensional) data and
external data obtained via a Web service call. This is achieved using
the Active XML technology that is based on exchanging XML documents
with embedded service calls both for the logical and physical data
model.
Nathalie Pernelle, Fatiha Sais, Hélène Gagliardi, Ollivier Haemmerlé-
IEEE Transactions on Knowledge and Data Engineering 2006 june
Ce travail a pour objectif d'enrichir automatiquement un
entrepôt théematique de donnéees `a partir de documents de format divers
provenant du Web et comportant des tableaux. Cet enrichissement est
guidée par une ontologie comportant un ensemble de termes et de relations.
Le caract`ere automatique de notre approche nous a conduits `a
utiliser un format flexible permettant de repréesenter les donnéees (termes
ou relations) dont l'identification est partielle, incertaine ou multiple.
Cet article se focalise sur la déecouverte dans les tableaux de relations
non identifiéees ainsi que sur la déecouverte d'attributs qui peuvent enrichir
ou modifier l'interpréetation de relations identifiées.
Marie-Christine Rousset, Francois Goasdou, Philippe Adjiman, Philippe Chatalic, Laurent Simon-
IEEE Transactions on Knowledge and Data Engineering 2006 - 25:269,314
In a peer-to-peer inference system, each peer can reason locally but can also solicit some of its acquaintances, which are peers sharing part of its vocabulary. In this paper, we consider peer-to-peer inference systems in which the local theory of each peer is a set of propositional clauses defined upon a local vocabulary. An important characteristic of peer-to-peer inference systems is that the global theory (the union of all peer theories) is not known (as opposed to partition-based reasoning systems). The main contribution of this paper is to provide the first
consequence finding algorithm in a peer-to-peer setting: it is anytime and computes consequences gradually from the solicited peer to peers that are more and more distant. We exhibit a sufficient condition on the acquaintance graph of the peer-to-peer inference system for guaranteeing the completeness of this algorithm. Another important contribution is to apply this general distributed reasoning setting to the setting of the Semantic Web through the SomeWhere semantic peer-to-peer data management system. The last contribution of this paper is to provide an experimental analysis of the scalability of the peer-to-peer infrastructure that we propose, on large
networks of 1000 peers.
Thomas Schwentick, Mathias Samuelides, Mikolaj Bojanczyk, Luc Segoufin -
IEEE Transactions on Knowledge and Data Engineering 2006
Two variants of pebble tree-walking automata on trees are considered that were introduced in the literature. It is shown that for each number of pebbles, the two models have the same expressive power both in the deterministic case and in the nondeterministic case. Furthermore, nondeterministic (resp. deterministic) tree-walking automata with n+1 pebbles can recognize more languages than those with n pebbles. Moreover, there is a regular tree language that is not recognized by any tree-walking automaton with pebbles. As a consequence, FO+posTC is strictly included in MSO over trees.
Chantal Reynaud, Cedric Jacquiot, Yolaine Bourda, Fabrice Popineau, Alexandre Delteil-
IEEE Transactions on Knowledge and Data Engineering 2006 Juny - LNCS:131--140
This paper introduces GLAM, a system based on situation calculus and meta-rules, which is able to provide adaptation by means of selection of actions. It is primarily designed to provide adaptive navigation. The different levels of conception, related to different aspects of the available metadata, are split in different layers in GLAM, in order to ease the conception of the adaptation system as well as to increase the potential use of complex adaptation mechanisms. GLAM uses meta-rules to handle these layers.
Chantal Reynaud, Brigitte Safar-
IEEE Transactions on Knowledge and Data Engineering 2006 Janvier
Ce travail se situe dans le domaine de l'integration de sources d'informations heterogenes. Il vise l'integration d'une nouvelle source XML a un serveur d'information en exploitant une ontologie, c'est-a-dire une description explicite et declarative de la semantique d'un domaine d'application. Nous proposons d'automatiser la generation des mises en correspondance, ou mappings, entre l'ontologie et la source a integrer. Nous montrons que ce processus repose non seulement sur l'exploitation des schemas des modeles a connecter mais aussi sur les donnees associees. Un prototype, DetectHybridemap, developpe en java, est a l'origine d'experimentations sur des donnees reelles fournies par France Telcom R&D dans le domaine du tourisme.
Ioana Manolescu, Angela Bonifati, Andrei Arion, Andrea Pugliese-
IEEE Transactions on Knowledge and Data Engineering 2006 :1077-1078
Serge Abiteboul, Pierre Senellart-
IEEE Transactions on Knowledge and Data Engineering 2006 Mar
We present in this paper a new model for representing probabilistic information in a semi-structured (XML) database, based on the
use of probabilistic event variables. This work is motivated by the need of keeping track of both confidence and lineage of the information stored
in a semi-structured warehouse. For instance, the modules of a (Hidden Web) content warehouse may derive information concerning the semantics
of discovered Web services that is by nature not certain. Our model, namely the fuzzy tree model, supports both querying (tree pattern queries with join) and updating (transactions containing an arbitrary set of insertions and deletions) over probabilistic tree data. We highlight its expressive power and discuss implementation issues.
Serge Abiteboul, Victor Vianu, Luc Segoufin-
IEEE Transactions on Knowledge and Data Engineering 2006 - 31(1):208--254
e study the representation and querying of XML with incomplete
information. We consider a simple model for XML data and their DTDs,
a very simple query language, and a representation system for
incomplete information in the spirit of the representations systems
developed by Imielinski and Lipski for relational databases. In the
scenario we consider, the incomplete information about an XML document
is continuously enriched by successive queries to the document. We
show that our representation system can represent partial information
about the source document acquired by successive queries, and that it
can be used to intelligently answer new queries. We also consider the
impact on complexity of enriching our representation system or query
language with additional features. The results suggest that our
approach achieves a practically appealing balance between
expressiveness and tractability.
Francois Goasdou, Laurent Simon, Marie-ChristineRousset, P.hilippe Adjiman, Philippe Chatalic -
IEEE Transactions on Knowledge and Data Engineering 2006 October:698--703
In this invited talk, we present the SomeWhere approach and infrastructure
for building semantic
peer-to-peer data management systems based on simple personalized
ontologies distributed at a large scale.
Somewhere is based on a simple class-based data model in which the data
is a set of resource
identifiers (e.g., URIs), the schemas are (simple) definitions of
classes possibly constrained by
inclusion, disjunction or equivalence statements, and mappings are
inclusion, disjunction or equivalence
statements between classes of different peer ontologies.
In this setting, query answering over peers can be done by distributed
query rewriting, which can be
equivalently reduced to distributed consequence finding in propositional
logic.
It is done by using the message-passing distributed algorithm that we have
implemented for consequence finding
of a clause w.r.t a set of distributed propositional theories. We
summarize its main properties (soundness,
completeness and termination), and we report experiments showing
that it already scales up to a thousand
of peers.
Finally, we mention ongoing work on extending the current data model
to RDF(S) and on handling possible
inconsistencies between the ontologies of different peers.
Chantal Reynaud, Brigitte Safar, Hassen Kefi-
IEEE Transactions on Knowledge and Data Engineering 2006 October:39-40
This paper deals with taxonomy alignment and presents the structural techniques of an alignment method suitable with a dissymmetry in the structure of the mapped taxonomies. The aim is to allow a uniform access to ducuments belonging to a same application domain, assuming retrieval of documents is supported by taxonomies. We applied our method to various taxonomies using our prototype TaxoMap.
Chantal Reynaud, Brigitte Safar-
IEEE Transactions on Knowledge and Data Engineering 2006 June
This paper deals with taxonomy alignment and presents structural techniques of an alignment method suitable with a dissymmetry in the structure of the mapped taxonomies. The aim is to allow a uniform access to documents belonging to a same application domain, assuming retrieval of documents is supported by taxonomies. We applied our method to various taxonomies using our prototype, TaxoMap. Experimental results are given and analyzed to demonstrate the effectiveness of the alignment approach. We also give comments about test cases and sketch some ideas to make improvements in order to widen the scope of the approach.
Yannis Papakonstantinou, Veronique Benzaken, Ioana Manolescu, Andrei Arion-
IEEE Transactions on Knowledge and Data Engineering 2006
The performance of XML database queries can be greatly enhanced by employing materialized
views. We present containment and rewriting algorithms for tree pattern queries that correspond
to a large and important subset of XQuery, in the presence of a structural summary of the database
(i.e., in the presence of a Dataguide). The tree pattern language captures structural identi?ers and
optional nodes, which allow us to translate nested XQueries into tree patterns. We character-
ize the complexity of tree pattern containment and rewriting, under the constraints expressed in
the structural summary, whose enhanced form also entails integrity constraints. Our approach is
implemented in the ULoad [5] prototype and we present a performance analysis.
Chantal Reynaud, Brigitte Safar, Hassen Kefi-
IEEE Transactions on Knowledge and Data Engineering 2006
Ce papier porte sur la g
Francois Goasdou, Jean Charlet, Alain Leger, Johannes Heinecke, Lyndon J.B. Nixon, Pavel Shvaiko, Paola Hobson-
IEEE Transactions on Knowledge and Data Engineering 2006
Semantic Web technology is being increasingly applied in a large spectrum of applications in which domain knowledge is conceptualized and formalized (e.g., by means of an ontology) in order to support diversified and automated knowledge processing (e.g., reasoning) performed by a machine. Moreover, through an optimal combination of (cognitive) human reasoning and (automated) machine reasoning and processing, it is possible for humans and machines to share complementary tasks. The spectrum of applications is extremely large and to name a few: corporate portals and knowledge management, e-commerce, e-work, e-business, healthcare, e-government, natural language understanding and automated translation, information search, data and services integration, social networks and collaborative filtering, knowledge mining, business intelligence and so on. From a social and economic perspective, this emerging technology should contribute to growth in economic wealth, but it must also show clear cut value for everyday activities through technological transparency and efficiency. The penetration of Semantic Web technology in industry and in services is progressing slowly but accelerating as new success stories are reported. In this paper and lecture we present ongoing work in the cross-fertilization between industry and academia. In particular, we present a collection of application fields and use cases from enterprises which are interested in the promises of Semantic Web technology. The use cases are detailed and focused on the key knowledge processing components that will unlock the deployment of the technology in the selected application field. The paper ends with the presentation of the current technology roadmap designed by a team of Academic and Industry researchers.
Yannis Papakonstantinou, Veronique Benzaken, Ioana Manolescu, Andrei Arion-
IEEE Transactions on Knowledge and Data Engineering 2006 12
We have described a method for producing, during algebraic
XQuery optimization, tree patterns equivalent to increasingly
larger algebraic plans. The benefit of producing such patterns is to bridge the gap between the seemingly
disparate formalisms, and to enable translation of results on
containment and equivalence under structural constraints,
as well as twig cardinality estimation, directly to algebraic
plans. Our approach is fully implemented in ULoad [8] and
we have demonstrated its low overhead.
Chantal Reynaud, Brigitte Safar-
IEEE Transactions on Knowledge and Data Engineering 2006
This paper deals with taxonomy alignment. It presents structural techniques of an alignment method suitable with a dissymmetry in the structure of the mapped taxonomies. The aim is to allow a uniform access to documents belonging to a same application domain, assuming retrieval of documents is supporting by taxonomies.
D., P., L.-
IEEE Transactions on Knowledge and Data Engineering 2005 (43)
Nathalie Pernelle, Ollivier Haemmerl, Fatiha Sais, Helene Galiardi-
IEEE Transactions on Knowledge and Data Engineering 2005 october:374--376
Yannis Papakonstantinou, Ioana Manolescu-
IEEE Transactions on Knowledge and Data Engineering 2005 April
Christine Froidevaux, Sarah Cohen-Boulakia, Susan Davidson-
IEEE Transactions on Knowledge and Data Engineering 2005
Biologists face two problems in interpreting their experiments: the
integration of their data with information from multiple heterogeneous
sources and data analysis with bioinformatics tools. It is difficult
for scientists to choose between the numerous sources and tools
without assistance. Following a thorough analysis of
scientistsx{2019} needs during the querying process, we found that
biologists express preferences concerning the sources to be queried
and the tools to be used. Interviews also showed that the querying
process itself the strategy followed differs between scientists. In
response to these findings, we have introduced a user-centric
framework allowing to specify various querying processes. Then we
have developed the BioGuide system which helps the scientists to
choose suitable sources and tools, find complementary information in
sources, and deal with divergent data. It is generic in that it can be
adapted by each user to provide answers respecting his/her
preferences, and obtained following his/her strategies.
Luc Segoufin, Thomas Schwentick, Anca Muscholl-
IEEE Transactions on Knowledge and Data Engineering 2005
An Active Context-Free Game is a game with two players I and II on strings over a finite alphabet. In each move, I selects a position of the current word and II rewrites the corresponding letter according to a rule of a context-free grammar. I wins if a string of the regular target language is reached. The complexity of deciding winning strategies for I is investigated, depending on properties of the grammar, of the target language, and on restrictions on the strategy.
Ventos, V., H. Soldano-
IEEE Transactions on Knowledge and Data Engineering 2005 february:to appear
Veronique Ventos, Henry Soldano-
IEEE Transactions on Knowledge and Data Engineering 2005 (3403):298-313
Nathalie Pernelle, Ollivier Haemmerl, Fatiha Sais, Helene Galgliardi-
IEEE Transactions on Knowledge and Data Engineering 2005 july:64--71
Serge Abiteboul, Benjamin Nguyen, Gabriela Ruberg-
IEEE Transactions on Knowledge and Data Engineering 2005
A chapter in the book Managing and Processing Complex Data for Decision Support, J. Darmont and O. Boussa
Luc Segoufin, Anca Muscholl, Mathias Samuelides-
IEEE Transactions on Knowledge and Data Engineering 2005
We consider various kinds of deterministic tree-walking automata, with
and without pebbles, over ranked and unranked trees. For each such kind of
automata we show that there is an equivalent one which never loops. The main
consequence of this result is the closure under complementation of the
various types of automata we consider with a focus on the number of pebbles
used in order to complement the automata.
Serge Abiteboul, Ioana Manolescu, Nicoleta Preda-
IEEE Transactions on Knowledge and Data Engineering 2005 April:1122-1123
Gloria-Lucia Giraldo-Gomez-
IEEE Transactions on Knowledge and Data Engineering 2005
Serge Abiteboul, T. Milo, Z. Abrams, S. Haar-
IEEE Transactions on Knowledge and Data Engineering 2005
We study in this paper the diagnosis of distributed telecommunication systems. We show that (i) the problem can be modeled using Datalog programs, and (ii) it can benefit from the large battery of optimization techniques developed for Datalog. In particular, we show some surprising relationships between techniques used for solving efficiently the diagnosis problem of telecommunication systems and a known Datalog optimization technique, namely Query-sub-query (QSQ). We adapt QSQ to a distributed setting. We show that a simple ``generic'' use of this extended QSQ achieves an optimization as good as that previously provided by the dedicated diagnosis algorithms. Furthermore, we show that it allows solving efficiently a much larger class of system analysis problems.
Serge Abiteboul, Tova Milo, Xavier Leroy, Boris Vrdoljak, Roberto Di Cosmo, St Fermigier, St Lauri, Fr Lepied, Radu Pop, Florent Villard, Jean-Paul Smets, Ciar Bryce, Klaus R. Dittrich, Assaf Sagi, Yotam Shtossel, Eleonora Panto-
IEEE Transactions on Knowledge and Data Engineering 2005
The open-source software community is now comprised of a very large and growing number of contributors and users. The GNU/Linux operating system for instance has an estimated 18 million users worldwide and its contributing developers can be counted by thousands. The critical mass
of contributors taking part in various opensource projects has helped to ensure high quality for open source software. However, despite the achievements of the open-source software industry, there are issues in the production of large scale open-source software (OSS) such as the GNU/Linux operating system that have to be addressed as the numbers of users, of contributors, and of available applications grow. EDOS is a European project supported by IST started October 2004 and ending in 2007, whose objective is to provide a new generation of methodologies, theoretical models, technical tools and quality models specifically tailored to OSS engineering and to software distribution over the Internet.
Marie-Christine Rousset, Alexandre Termier, Mich Sebag, Kouzou Ohara, Takashi Washio, Hiroshi Motoda-
IEEE Transactions on Knowledge and Data Engineering 2005
In this paper, we present a new tree mining algorithm, DRYADEPARENT, based on the hooking principle first introduced in DRYADE. In the experiments, we demonstate that the branching factor and depth of the frequent patterns to find are key factors of complexity for tree mining algorithms. We whow that DRYADEPARENT outperforms the current fastest algorithm, CMTreeMiner, by orders of magnitude on datasets where the frequent patterns have a high branching factor.
Nathalie Pernelle, Marie-Christine Rousset, H Gagliardi, Ollivier Haemmerl, Fatiha Sais, Damiano Migliori-
IEEE Transactions on Knowledge and Data Engineering 2005
Nathalie Pernelle, H Gagliardi, Ollivier Haemmerl, Fatiha Sais-
IEEE Transactions on Knowledge and Data Engineering 2005 january - 2:407--419
Ce travail a pour objectif la construction automatique d'un entrepot thematique de donnees, a partir de documents de format divers provenant du Web. L'exploitation de cet entrepot est assuree par un moteur d'interrogation fonde sur une ontologie. Notre attention porte plus precisement sur les tableaux extraits de ces documents et convertis au format XML, aux tags exclusivement syntaxiques. Cet article presente la transformation de ces tableaux, sous forme XML, en un formalisme enrichi semantiquement dont la plupart des tags et des valeurs sont des termes construits a partir de l'ontologie.
Serge Abiteboul, Bernd Amann, Tova Milo, Omar Benjelloun, Fred Dang Ngoc-
IEEE Transactions on Knowledge and Data Engineering 2005 March - 30(1):1-40
XML is becoming the universal format for data exchange between applications. Recently, the emergence
of Web services as standard means of publishing and accessing data on the Web introduced
a new class of XML documents, which we call intensional documents. These are XML documents
where some of the data is given explicitly while other parts are defined only intensionally by means
of embedded calls to Web services.
When such documents are exchanged between applications, one has the choice of whether or
not to materialize the intensional data (i.e., to invoke the embedded calls) before the document
is sent. This choice may be influenced by various parameters, such as performance and security
considerations. This article addresses the problem of guiding this materialization process.
We argue that
Amar-Djalil Mezaour-
IEEE Transactions on Knowledge and Data Engineering 2005 June 13-16 - ??:269-278
Ordinary sources, like databases and general-pupose document collections, seems to be insufficient and inadequate to scale
the needs and the requirements of the new generation of warehouses: thematic data warehouses. Knowing that more and more
online thematic data is available, the web can be considered as a useful data source for populating thematic data
warehouses. To do so, the warehouse data supplier must be able to filter the heterogeneous web content to keep only the
documents corresponding to the warehouse topic. Therefore, building efficient automatic tools to characterize web
documents dealing with a given thematic is essential to challenge the warehouse data acquisition issue. In this paper, we
present our filtering approach implemented in an automatic tool called "eDotFilter".
This tool is used to filter crawled documents to keep only the documents dealing with food risk. These documents are then
stored in a thematic warehouse called "eDot". Our filtering approach is based on
"WeQueL", a declarative web query langage that improves the expressive power of keyword-based queries.
Pierre Senellart-
IEEE Transactions on Knowledge and Data Engineering 2005 April
We present in this paper a method to discover the set of webpages contained in a logical website, based on the link structure of the Web graph. Such a method is useful in the context of Web archiving and website importance computation. To identify the boundaries of a website, we combine the use of an online version of the preflow
Pierre Senellart-
IEEE Transactions on Knowledge and Data Engineering 2005 July:124--129
We present in this paper a method to discover the set of webpages contained in a logical website, based on the link structure of the Web graph. Such a method is useful in the context of Web archiving and website importance computation. To identify the boundaries of a website, we combine the use of an online version of the preflow-push algorithm, an algorithm for the maximum flow problem in traffic networks, and of the Markov CLuster (MCL) algorithm. The latter is used on a crawled portion of the Web graph in order to build a seed of initial webpages, a seed which is extended using the former. An experiment on a subsite of the INRIA Website is described.
Pierre Senellart, Matts Attn, Jean Senellart-
IEEE Transactions on Knowledge and Data Engineering 2005 September
A general mature rule-based MT system is bound to reach a saturation point because of the intrinsic complexity of the natural language description. For such systems, maintenance is a complex task and customization is expensive and time-consuming. Therefore, improving the system's interaction with the linguistic rules has proven to be more productive than endlessly adding new rules and exceptions to reach the theoretical accuracy level of 100%. In this paper, we describe our strategy to "open up" such a system and provide practical details on how this was done on SYSTRAN's classical engines. This approach is the foundation of the SYSTRAN version 5 translation engines. We show the immediate benefits of the evolution of our architecture and the new extension potentiality.
Marie-Christine Rousset, Ollivier Haemmerl, Damiano Migliori-
IEEE Transactions on Knowledge and Data Engineering 2005
In this paper we present a method for integrating and querying XML data in a relational setting. thouigh this method is generic, it has been motivated and validated by a knowledge management application on Microbiology: the e.dot projet.
Boris Vrdoljak, Marko Banek, Zoran Skocir-
IEEE Transactions on Knowledge and Data Engineering 2005 - 1:289-295
Data warehouse is a database that collects and integrates data from heterogeneous sources in order to support a decision making process.Data exchanged over the Internet and intranets has recently become an important data source, having XML as a standard format for exchange. The possibility of integrating available XML data into data warehouses plays an important role in providing enterprise managers with up-to-date and
relevant information about their business domain. We have developed a methodology for data warehouse design from the source XML Schemas and conforming XML documents. As XML data is semi-structured, data warehouse design from XML brings many particular challenges. In this paper the final steps of deriving a conceptual multidimensional scheme are described, followed by the logical design, where a set of tables is created according to the derived conceptual scheme. A prototype tool
has been developed to test and verify the proposed methodology.
Ioana Manolescu, Loredana Afanasiev, Philippe Michiels-
IEEE Transactions on Knowledge and Data Engineering 2005 September - 3671:144-161
Stefano Ceri, Marco Brambilla, Ioana Manolescu, Sara Comai and Piero Fraternali-
IEEE Transactions on Knowledge and Data Engineering 2005 August - 5(3)
Ioana Manolescu, Angela Bonifati, Andrei Arion, Andrea Pugliese-
IEEE Transactions on Knowledge and Data Engineering 2005 November
We study the applicability of XML path summaries in the context of current-day XML databases. We find that summaries provide an excellent basis for optimizing data access methods, which furthermore mixes very well with path-partitioned store. We provide practical algorithm for building and exploiting summaries, and prove its benefits through extensive experiments .
Amar-Djalil Mezaour, Alphonse, Ahmed Amrani, Jerome Aze, Thomas Heitz, Mathieu Roche-
IEEE Transactions on Knowledge and Data Engineering 2005 june
Serge Abiteboul, Pierre Senellart-
IEEE Transactions on Knowledge and Data Engineering 2005 Dec
We present in this paper a new model for representing probabilistic information in a semi-structured (XML) database, based on the use of probabilistic event variables. This work is motivated by the need of keeping track of both confidence and lineage of the information stored
in a semi-structured warehouse. For instance, the modules of a (Hidden Web) content warehouse may derive information concerning the semantics of discovered Web services that is by nature not certain. Our model supports both querying (tree pattern queries with join) and updating (transactions containing an arbitrary set of insertions and deletions) over probabilistic tree data. We show that the model is as expressive as the impractical possible worlds model and that it is more expressive than a simple model where probability (confidence) is stored at each node of
the tree. We briefly discuss implementation issues.
Amar-Djalil Mezaour-
IEEE Transactions on Knowledge and Data Engineering 2005
Serge Abiteboul, Tova Milo, Omar Benjelloun-
IEEE Transactions on Knowledge and Data Engineering 2005
We consider here the exchange of Active XML (AXML) data, i.e., XML
documents where some of the data is given explicitly while other parts
are given only intensionally as calls to Web services. Peers
exchanging AXML data agree on a {em data exchange schema} that
specifies in particular which parts of the data are allowed to be
intensional. Before sending a document, a peer may need to {em
rewrite} it to match the agreed data exchange schema, by calling some
of the services and {em materializing} their data. Previous works
showed that the rewriting problem is undecidable in the
general case and of high complexity in some restricted cases. We
argue here that this difficulty is somewhat artificial. Indeed, we
study what we believe to be a more adequate, from a practical view
point, rewriting problem that is (1) in the spirit of standard
1-unambiguity constraints imposed on XML schema and (2) can be solved
by a single pass over the document with a computational device not
stronger than a finite state automaton. Following previous works, we
focus on the core of the problem, i.e., on the problem on {em
words}. The results may be extended to (A)XML trees in a
straightforward manner.
Luc Segoufin, Michael Benedikt-
IEEE Transactions on Knowledge and Data Engineering 2005
The conference version was incorrect and has been removed. The file is the journal version.
Marie-Christine Rousset, Francois Goasdou, Philippe Adjiman, Philippe Chatalic, Laurent Simon-
IEEE Transactions on Knowledge and Data Engineering 2005
In peer-to-peer inference systems, each peer can reason locally but also solicit some of its acquaintances, sharing part of its vocabulary. This paper studies both theoretically and experimentally the problem of computing proper prime implicates for propositional peer-to-peer systems, the global theory (union of all peer theories) of which is not known (as opposed to partition-based reasoning).
Serge Abiteboul, Tova Milo, Haim Kaplan, Bogdan Cautis, Amos Fiat-
IEEE Transactions on Knowledge and Data Engineering 2005
We consider the secure exchange of information in a multi-party
environment, where intermediaries can modify data and queries in a
controlled way, e.g., by filtering out some information or
strengthening some query. We propose a data exchange model that
permits the specification of admissible updates, along with an
array of cryptography-based techniques to validate updates and to
guarantee the provenance of data and queries in different
scenarios. In particular, we consider the cases where one insists
on exposing or hiding the identity of peers that performed
changes, and the changes they performed. We also consider the
packaging of signatures with the data (to be able to verify it)
and interactive scenarios where validity is proved via message
exchange.
The underlying thesis developed in the paper is that a newly
proposed model, consisting of XML documents with embedded calls to
Web services (namely Active XML), is an appropriate paradigm to
manage secured data exchange in a distributed setting. We show
that Active XML enables addressing in a uniform manner, in a
dynamic and multi-party setting, issues such as provenance,
authenticity of (portions of) documents, access control, and query
processing; issues typically ignored or considered separately by
traditional data exchange models.
Serge Abiteboul, Ioana Manolescu, Nicoleta Preda-
IEEE Transactions on Knowledge and Data Engineering 2005 October
The current development of Web standards and technologies has brought new opportunities for large-scale integration of Web resources: data sources (such asXML, HTML, or PDF files), distributed applications(typically accessed via Web Services), and semantic information (represented in formats such as OWL orRDF). At the same time, new peer-to-peer platforms are being developed, and increasingly used for data managementat the network scale. We present KADOP, a model, architecture, and system for sharing content in a peer-to-peer environment. By building on distributed hash tables technology, XML indexing and query optimization techniques, and on the paradigm of active documents, KADOP enables the publication, discovery, and efficient exploitation of content in a P2P environment. In some sense, with KADOP, the network can be seen as a content warehouse (for a domain of interest) as well as an entry point to external resources, in mediator style.The system presented here as been implemented and is being tested. A novel aspect of the work is the use of active documents, to: (i) automatically maintain and refresh the distributed index, (ii) greatly reduce its overall space occupancy, (iii) provide bridges to other data repositories, such as databases or other peer-to-peernet works, and (iv) dynamically combine data-centric Web services to answer complex queries. Keywords: XML, ActiveXML, data integration, Web services, DHT, intensional indexing, XML queryprocessing.
Marie-Christine Rousset, Francois Goasdou, Philippe Adjiman, Philippe Chatalic, Laurent Simon-
IEEE Transactions on Knowledge and Data Engineering 2005
In this paper, we describe the SomeWhere semantic peer-to-peer data management system that promotes a "small is beautiful" vision of the Semantic Web based on simple personalized ontologies (e.g., taxonomies of classes) but which are distributed at a large scale. In this vision of the Semantic Web, no user imposes to others his own ontology. Logical mappings between ontologies make possible the creation of a web of people in which personalized semantic marking up of data cohabits nicely with a collaborative exchange of data. In this view, the Web is a huge peer-to-peer data management system based on simple distributed ontologies and mappings.
Pierre Senellart, Jean Senellart-
IEEE Transactions on Knowledge and Data Engineering 2005 November
We present the SYSTRAN Translation Stylesheets (STS) used by the SYSTRAN machine translation engines that power the most prominent online translation services on the Web today (including Google, Yahoo!, AltaVista's BabelFish, and AOL). SYSTRAN develops and markets the world's most widely used machine translation software which powers products and solutions for the desktop, network infrastructures and the Internet that facilitate communication in 40 language combinations and in 20 specialized domains. The choice of leading global corporations, search engines and governments, SYSTRAN's software is applied across diverse best-practice solutions for intra-company communications, content management, online customer support, eCommerce, email systems, chat, and more. SYSTRAN's expertise spans over three decades of building customized translation solutions through open and robust architectures.
XSL Transformation stylesheets are usually used to transform a document described in an XML formalism into another XML formalism, to modify an XML document, or to publish content stored into an XML document to a publishing format (XSL-FO, (X)HTML...). SYSTRAN Translation Stylesheets (STS) use XSLT to drive and control the machine translation of XML documents (native XML document formats or XML representations - such as XLIFF - of other kinds of document formats).
STS do not only provide a simple way to indicate which part of the document text is to be translated, but also enables the fine-tuning of translation, especially by using the structure of the document to help disambiguate natural language semantics and determine proper context. For instance, the phrase "Access From Front Door" is to be analyzed as "The access from front door" within a title, and as "Do access (something) from front door" in the text body. In that case, the STS would pass a title option to the translation engine. The stylesheet can activate specialized domain dictionaries for some parts of the document and can mark some expressions as not to be translated, in the same manner.
Another key application of STS is to consider machine translation as part of the authoring and publishing process: source documents can be annotated with natural language markup produced by the author, markup which will be processed by STS to improve the quality of translation, the gateway to the automatic publishing of a multilingual website from a monolingual (annotated) source. This composition, inside the publishing process, is a real breakthrough for Web content translation. Traditionally, machine translation is applied after publishing and does not have access to the original structure of the document, but only to its HTML representation.
The mechanism is implemented through XSLT extension functions. In particular, the stylesheet uses a systran:translate function to translate an XML fragment, and systran:getValue/systran:pushValue/systran:popValue functions for consulting and for setting linguistics options in the translation engine. Proper management of character properties is also provided so that, for instance, the translation of a phrase in bold font will appear in bold font, even if the phrase has moved within the translated sentence.
This process is highly customizable by the addition of new templates into the stylesheets. Because the translation is driven by the document structure, it is easier to leverage this structure during the translation process and to keep it in the translated document, than with traditional document filters, which process the entire document linearly.
STS are part of SYSTRAN's commercial version 5.0 product line and are also used for and by specific corporate-customers.
Serge Abiteboul, Tova Milo, Omar Benjelloun-
IEEE Transactions on Knowledge and Data Engineering 2005 10
This paper provides an overview of the Active XML project developed at
INRIA over the past three years. Active XML (AXML, for short), is a
declarative framework that harnesses web services for data
integration, and is put to work in a peer-to-peer architecture. An
AXML document is an XML document that may contain calls to Web
services. An AXML "peer" is a repository of AXML documents. It acts
both as a client, invoking the Web services embedded in the documents,
and as a server, providing Web service defined as queries over the
active documents it contain. The approach gracefully combines stored
information with data defined in an intentional manner as well as
dynamic information. This simple rather classical idea leads to a
number of technically challenging problems, both theoretical and
practical, which will be overviewed in this paper. We will describe
the AXML model and language, overview the research results obtained in
the course of the project, and show how the various pieces are
assembled together in the system implementation.
Serge Abiteboul, Hector Garcia-Molina, J. Widom, J. D. Ullman, Stefano Ceri, Rakesh Agrawal, Philip A. Bernstein, Michael J. Carey, W. Bruce Croft, David J. DeWitt, Michael J. Franklin, Dieter Gawlick, Jim Gray, Laura M. Haas, Alon Y. Halevy, Joseph M. Hellerstein, Yannis E. Ioannidis, Martin L. Kersten, Michael J. Pazzani, Michael Lesk, David Maier, Jeffrey F. Naughton, Hans-J Schek, Timos K. Sellis, R. Silberschatz, M. Stonebraker, R. T. Snodgrass, G. Weikum, S. Zdonik-
IEEE Transactions on Knowledge and Data Engineering 2005 - 48(5):111-118
A group of senior database researchers gathers every few years to assess the state of database research and to point out
problem areas that deserve additional focus. This report summarizes the discussion and conclusions of the sixth ad-hoc
meeting held May 4-6, 2003 in Lowell, Mass. It observes that information management continues to be a critical component
of most complex software systems. It recommends that database researchers increase focus on: integration of text, data, code,
and streams; fusion of information from heterogeneous data sources; reasoning about uncertain data; unsupervised data
mining for interesting correlations; information privacy; and self-adaptation and repair.
Luc Segoufin, Georg Gottlob, Christoph Koch, Reinhard Pichler-
IEEE Transactions on Knowledge and Data Engineering 2005 - 52(2):284--335
We study the complexity of two central XML processing problems. The first is XPath 1.0 query processing, which has been shown to be in PTime in previous work. We prove that boththe data complexity and the query complexity of XPath 1.0 fall into lower (highly parallelizable) complexity classes, while thecombined complexity is PTime-hard. Subsequently, we study the sources of this hardness and identify a large and practicallyimportant fragment of XPath 1.0 for which the combined complexity is LogCFL-complete and, therefore, in the highly parallelizablecomplexity class NC2.The second problem is the complexity of validating XML documents against various typing schemes like Document Type Definitions (DTDs),XML Schema Definitions (XSDs), and tree automata, both with respect to data and to combined complexity.For data complexity, we prove that validation is in LogSpace and depends crucially on how XML data is represented. For the combined complexity, we show that the complexity ranges from LogSpace to LogCFL, depending on the typing scheme.
D. Le Berre-
IEEE Transactions on Knowledge and Data Engineering 2005 :343--378
SAT Competition 2002 held in March-May 2002 in conjunction with SAT 2002 (the Fifth International Symposium on the Theory and Applications of Satisfiability Testing). About 30 solvers and 2300 benchmarks took part in the competition, which required more than 2 CPU years to complete the evaluation. In this report, we give the results of the competition, try to interpret them, and give suggestions for the future competitions.
Veronique Ventos, Henry Soldano-
IEEE Transactions on Knowledge and Data Engineering 2005 (19):799--827
Veronique Benzaken, Ioana Manolescu, Andrei Arion, Ravi Vijay-
IEEE Transactions on Knowledge and Data Engineering 2005 August:1330-1333
Ioana Manolescu, Angela Bonifati, Andrei Arion, Andrea Pugliese-
IEEE Transactions on Knowledge and Data Engineering 2005 Octobre - 2
Chantal Reynaud, Brigitte Safar, H Gagliardi-
IEEE Transactions on Knowledge and Data Engineering 2005
Building an information server, specific to a givendomain, over distributed and heterogeneous information sources,is based on knowledge describing the server domain, calleddomain ontology. We address the issue of representing suchontologies in the framework of the PICSEL research project.Two steps are described. The first one is directed by therepresentation formalism and the functionalities of theinformation server. The second one aims at refining andoptimizing knowledge obtained in step one. It is guided by theway functionalities of the server are implemented. The examplesin the paper are coming from the tourism products domain.
Victor Vianu, Luc Segoufin-
IEEE Transactions on Knowledge and Data Engineering 2005
We investigate the question of whether a query Q can be answered using a set V
of views. We first define the problem in information-theoretic terms: we say
that V determines Q if V provides enough information to uniquely determ ine the
answer to Q. Next, we look at the problem of rewriting Q in terms of V using a
specif ic language. Given a view language VV and query language QQ, we say
that a rewriting language RR is complete for VV-to-QQ rewritings if every Q in
QQ can be rewritten in terms of V in VV using a query in RR, whenever V
determines Q. While query rewriting using views has been extensively
investigated for some specific languages, the connection to the info
rmation-theoretic notion of determinacy, and the question of completeness of a
rewriting language, have received little a ttention. In this paper we
investigate systematically the notion of determinacy and its co nnection to
rewriting. The results concern decidability of determinacy for various view
and query languages, as well as the power requir ed of complete rewriting
languages. We consider languages ranging from first-order to conjunctive
queries.
Veronique Benzaken, Ioana Manolescu, Andrei Arion-
IEEE Transactions on Knowledge and Data Engineering 2005 June
Serge Abiteboul, Antonella Poggi-
IEEE Transactions on Knowledge and Data Engineering 2005
Data integration is the problem of combining data residing at different
sources, and providing the user with a virtual view, called global schema, which
is independent from the model and the physical origin of the sources. Whereas
many data integration systems and theoretical works have been proposed for relational
data, not much investigation has focused yet on XML data integration.
Our goal is therefore to address some of its related issues. In particular, we highlight
two major issues that emerge in the XML context: (i) the global schema may
be characterized by a set of constraints, expressed by means of a DTD and XML
integrity constraints, (ii) the concept of node identity requires to introduce semantic
criteria to identify nodes coming from different sources. We propose a formal
framework for XML data integration systems based on an expressive XML global
schema, a set of XML data sources and a set of mappings specified by means of
a simple tree language. Then, we define an identification function that aims at
globally identifying nodes coming from different sources. Finally, we propose
algorithms to answer queries under different assumptions for the mappings.
Benjamin Nguyen, Ioana Manolescu, Francois-Xavier Dudouet, Pierre Senellart-
IEEE Transactions on Knowledge and Data Engineering 2005 October
In this article, we describe a novel application of XML and Web based technologies: a sociological study of the W3C standardization process. We propose a new methodology and tools, to be used by sociologists to study the standardization process, illustrated by the W3C XQuery Working Group. The novelty of our approach has many facets. Information Technology (IT) has received little attention from sociologists, yet the standardization of the Web is a crucial issue, both economical and political, based on the use of a semi-structured content warehouse. We introduce a modeling and querying approach of an XML content warehouse, and show it produces high added-value information. This information is used to conduct a preliminary sociological analysis of the XQuery standardization process.
Benjamin Nguyen, Ioana Manolescu, Francois-Xavier Dudouet, Pierre Senellart-
IEEE Transactions on Knowledge and Data Engineering 2005 April
In this article, we describe a novel application of XML and Web based technologies: a sociological
study of the W3C standardization process. We propose a new methodology and tools, to be used by
sociologists to study the standardization process, illustrated by the W3C XQuery Working Group. The novelty of our approach has many facets. Information Technology (IT) has received little attention from sociologists, yet the standardization of the Web is a crucial issue, both economical and political, based on the use of a semi-structured content warehouse. We introduce a modeling and querying approach of an XML content warehouse, and show it produces high added-value information. This information is used to conduct a preliminary sociological analysis of the XQuery standardization process.
Yannis Papakonstantinou, Ioana Manolescu-
IEEE Transactions on Knowledge and Data Engineering 2005 April:1143-1143
Serge Abiteboul, Ioana Manolescu, Nicoleta Preda, Benjamin Nguyen-
IEEE Transactions on Knowledge and Data Engineering 2004 December - 3493:358-371
This article presents our work within the INEX 2004 Heterogeneous Track. We focused on taming the structural diversity within the INEX heterogeneous bibliographic corpus. We demonstrate how semantic models and associated inference techniques can be used to solve the problems raised by the structural diversity within a given XML corpus. The first step automatically extracts a set of concepts from each class of INEX heterogeneous documents. An unified set of concepts is then computed, which synthesizes the interesting concepts from the whole corpus. Individual corpora are connected to the unified set of concepts via conceptual mappings. This approach is implemented as an application of the KadoP platform for peer-to-peer warehousing of XML documents. While this work caters to the structural aspects of XML information retrieval, the extensibility of the KadoP system makes it an interesting test platform in which components developed by several INEX participants could be plugged, exploiting the opportunities of peer-to-peer data and service distribution.
Luc Segoufin, Thomas Schwentick, Anca Muscholl-
IEEE Transactions on Knowledge and Data Engineering 2004
An Active Context-Free Game is a game with two players (Romeo and Juliet) on strings over a finite alphabet. Its rules aredefined by a context-free grammar. In each move, Juliet selects a position of the current word and Romeo rewrites thecorresponding letter according to a rule of the grammar. Juliet wins if a string of the regular target language is reached. Weconsider the complexity of deciding and extracting winning strategies for Juliet depending on properties of the grammar, ofthe regular target language, and on restrictions on the strategy.
Serge Abiteboul, Tova Milo, Omar Benjelloun-
IEEE Transactions on Knowledge and Data Engineering 2004 - 3055:17--27
XML and Web services are revolutioning the automatic management of
distributed information, somewhat in the same way that HTML, Web
browser and search engines modified human access to world wide
information. We argue in this paper that the combination of XML and
Web services allows for a novel query answering paradigm, where the
exchanged information mixes materialized and intensional, active,
information.
We illustrate the flexibility of this approach by presenting Active
XML that is based on embedding Web service calls in XML data. We
discuss advantages of an approach of query answering based on
intensional and active data.
Serge Abiteboul, Tova Milo, Omar Benjelloun, Bogdan Cautis-
IEEE Transactions on Knowledge and Data Engineering 2004
XML and Web services are revolutioning the automatic management of distributed information, somewhat in the same way that HTML, Web browsers and search engines modified human access to world wide information. We argue in this paper that the combination of XML and Web services allows for a novel distributed data management paradigm, where the exchanged information mixes materialized and intensional, active, information.
We illustrate the flexibility of this approach by presenting Active XML, a language that is based on embedding Web service calls in XML data. We focus
on two particular issues, namely security and access control.
Omar Benjelloun-
IEEE Transactions on Knowledge and Data Engineering 2004 September
This thesis introduces Active XML (AXML, for short), a declarative
framework that harnesses Web services for distributed data management,
and is put to work in a peer-to-peer architecture.
An AXML document is an XML document that may contain embedded calls to
Web services, whose invocation enriches the document. An AXML service
is a Web service that exchanges AXML documents. An AXML "peer" is a
repository of AXML documents. On the one hand, it acts as a client, by
invoking the service calls embedded in its documents. On the other
hand, a peer acts as a server, by providing AXML services that can be
declaratively specified as queries or updates over the AXML documents
of its repository. The AXML approach allows for gracefully combining
stored information with data defined in an intensional manner (as
service calls). The fact that AXML peers can exchange a mix of
materialized and intensional data (via AXML documents) leads to a very
powerful distributed data management paradigm.
The AXML approach leads to a number of important problems that are
studied in the thesis. First, we address the issue of controlling the
exchange of AXML data. We propose to use declarative schema
specifications, and provide algorithms to statically enforce
them. Second, we propose techniques for the "lazy evaluation" of
queries on AXML documents, that detect which embedded service calls
may contribute to query answers.
An implementation of AXML peers compliant with W3C standards is also
described in the thesis.
Serge Abiteboul, Tova Milo, Omar Benjelloun, Ioana Manolescu, Roger Weber-
IEEE Transactions on Knowledge and Data Engineering 2004 (3-540-40676-X)
We propose in this chapter a peer-to-peer architecture that allows for
the integration of distributed data and web services. It relies on a
language, Active XML, where (1) documents embed calls to web services
that are used to enrich them, and (2) new web services may be defined
by XQuery queries on such active documents.
Embedding calls to functions or even to web services inside data is
not a new idea. Our contribution, however, is to turn them into a
powerful tool for data and services integration. In particular, the
language includes linguistic features to control the timing of service
call activations. Various scenarios are captured, such as mediation,
data warehousing, and distributed computation. A first prototype is
also described.
Veronique Ventos, Henry Soldano, Thibaut Lamadon-
IEEE Transactions on Knowledge and Data Engineering 2004 november:555-558
In many applications there is a need to represent a large number of data by clustering them in a hierarchy of classes described at an appropriate level of abstraction. Our basic representation is a Galois lattice, i.e. a lattice in which the terms of a representation language are partitioned into equivalence classes w.r.t. their extent (the extent of a term is the part of the instance set that satisfies the term). The advantage of such a structure is that it exhaustively represents the whole set of concepts that are distinguishable given the instance set and the representation language. However the size of the lattice can be very large when dealing with real-world problems. What we propose here is to reduce the size of the lattice, and thus to simplify our view of the data, still conserving its formal structure and exhaustivity. For that purpose we use a preliminary partition of the instance set, representing the association of a "type" to each instance. By redefining the notion of extent of a term in order to cope, to a certain degree (denoted as alpha), with this partition, we define a particular family of Galois lattices denoted as particular Galois lattices denoted as
Alpha Galois lattices. We also discuss the related implication rules defined as inclusion of such alpha-extents.
Serge Abiteboul, Tova Milo, Omar Benjelloun, Irini Fundulaki, Bogdan Cautis, Bogdan Alexe, Arnaud Sahuguet-
IEEE Transactions on Knowledge and Data Engineering 2004
Marie-Christine Rousset, Francois Goasdou-
IEEE Transactions on Knowledge and Data Engineering 2004 - 4(3):255,288
In this paper, we investigate a first step towards the long-term vision of the Semantic Web by studying the problem of answering queries posed through a mediated ontology to multiple information sources whose content is described as views over the ontology relations. The contributions of this paper are twofold. We first offer a uniform logical setting which allows us to encompass and to relate the existing work on answering and rewriting queries using views. In particular, we make clearer the connection between the problem of rewriting queries using views and the problem of answering queries using extentions of views. Then we focus on an instance of the problem of rewriting conjunctive queries using views through an ontology expressed in a description logic, for which we exhibit a complete algorithm.
Chantal Reynaud-
IEEE Transactions on Knowledge and Data Engineering 2004
This paper deals with PICSEL mediator systems which integrate services. We propose a scalable approach which exploits standardized specifications provided by normalization organisms. The paper focuses on the use of such specifications to automate the construction of the mediator. An illustration in the tourism domain with OTA specifications is given.
Serge Abiteboul, Ioana Manolescu, Nicoleta Preda-
IEEE Transactions on Knowledge and Data Engineering 2004
We present KadoP, a distributed infrastructure for warehousing XML resources in a peer-to-peer framework. KadoP allows users to build a shared, distributed repository of resources such as XML documents, semantic information about such documents, Web services, and collections of such items. KadoP builds on distributed hash tables as a peer communication layer, and ActiveXML as a model for constructing and querying the resources in the peer network. We describe KadoP's data model, query language, and query processing paradigm.
Nicoleta Preda-
IEEE Transactions on Knowledge and Data Engineering 2004 september
Nous presentons KadoP, une infrastructure distribuee pour entreposer des ressources XML dans un environement pair-a-pair. KadoP permet aux utilisateurs de construire un entreot partage et distribue de ressources telles que des documents XML, des informations semantiques sur ces documents, des services Web et des collections de tels objets. KadoP utilise des tables de hachage pour la communication entre les pairs et ActiveXML comme modele pour construire et interroger les ressources du reseau de pairs.Nous decrivons le modele de donnees de KadoP, le langage d'interrogation et le paradigme de traitement des requetes.
Ioana Manolescu, Stefano Ceri, Marco Brambilla, Sara Comai, Piero Fraternali, Marco Dario-
IEEE Transactions on Knowledge and Data Engineering 2004
Serge Abiteboul-
IEEE Transactions on Knowledge and Data Engineering 2004
XML and Web services are revolutioning the automatic management of
distributed information, somewhat in the same way HTML, Web browser
and search engines modified human access to world wide information.
To illustrate, we present Active XML that is based on embedding
Web service calls inside XML documents. We mention two particular
applications that take advantage of this new technology and novel
research issues in this setting.
This paper is based primarily on research at INRIA-Futurs in the Gemo
Group, around XML and Web services (in particular, the Xyleme, Active
XML and Spin projects).
Marie-Christine Rousset, Francois Goasdou, Philippe Adjiman, Philippe Chatalic, Laurent Simon-
IEEE Transactions on Knowledge and Data Engineering 2004
In a peer-to-peer inference system, each peer can reason locally but can also solicit some of its acquaintances, which are peers sharing part of its vocabulary. In this paper, we consider peer-to-peer inference systems in which the local theory of each peer is a set of propositional clauses defined upon a local vocabulary. An important characteristic of peer-to-peer inference systems is that the global theory (the union of all peer theories) is not known (as opposed to partition-based reasoning systems). The contribution of this paper is twofold. We provide the first consequence finding algorithm in a peer-to-peer setting: it is anytime and computes consequences gradually from the solicited peer to peers that are more and more distant. We exhibit a sufficient condition on the acquaintance graph of the peer-to-peer inference system for guaranteeing the completeness of this algorithm. We also present first experimental results that are promising.
Marie-Christine Rousset, Alexandre Termier, Mich Sebag-
IEEE Transactions on Knowledge and Data Engineering 2004 :543--546
In this paper, we present a novel algorithm for discovering tree paterrns in a tree database. This algorithmuses a relaxed tree inclusion definition, making the problem complex (checking tree inclusion is NP-complete), butallowing to mine highly heterogeneous databases. To obtain good performances, our DRYADE algorithm discovers only closed frequent tree patterns.
Ioana Manolescu, Angela Bonifati, Andrei Arion, Gianni Costa, Andrea Pugliese-
IEEE Transactions on Knowledge and Data Engineering 2004 March
Alexandre Termier-
IEEE Transactions on Knowledge and Data Engineering 2004 April
La d
D. Berre, L.-
IEEE Transactions on Knowledge and Data Engineering 2004 to appear
For the third consecutive year, a SAT competition was organized as a joint event with the SAT conference. With 55 solvers from 25 author groups, the competition was a clear success. One of the noticeable facts from the 2004 competition is the superiority of incomplete solvers on satisfiable random k-SAT benchmarks. It can also be pointed out that the complete solvers awarded this year, namely Zchaff, jerusat, satzoo, kcnfs and march-eq, participated in the SAT 2003 competition (or at least former versions of those solvers). This paper is not reporting exhaustive competition results, already available in details online, but rather focuses on some remarkable results derived from the competition dataset. The SAT 2004 competition is ending a 3-year take-off period that attracted new SAT researchers and provided many new benchmarks and solvers to the community. The good participation rate of this year (despite the addition of the anti-blackbox rule) establishes the competition as an awaited yearly event. Some new directions for a new way of thinking about the competition are discussed at the end of the paper.
Amar-Djalil Mezaour-
IEEE Transactions on Knowledge and Data Engineering 2004 december:249--252
Nathalie Pernelle, Marie-Christine Rousset, Alexandre Termier, Helka Folch, Benoit Habert, Michele Jardino-
IEEE Transactions on Knowledge and Data Engineering 2004 - 4:1131,1334
Extensible Markup Language (XML) is playing an increasingly important role in the exchange of a wide variety of data on the Web and elsewhere. It is a simple, very flexible text format, used to annotate data by means of markup. XML documents can be checked for syntactic well-formedness and semantic coherence through DTD and schema validation which makes their processing easier. In particular, data with nested structure can be easily represented with embedded tags. This structured representation should be used in information retrieval models which take structure into account. As such, it is meta-data and therefore a contribution to the Semantic Web. However, nowadays, there exists huge quantities of raw texts and the issue is how to find an easy way to provide these texts with sensible XML structure. Here we present an automatic method to extract tree structure from raw texts. This work has been supported by the Paris XI University (BQR2002 project, Paris-XI University).
Marie-Christine Rousset, Francois Goasdou-
IEEE Transactions on Knowledge and Data Engineering 2004
L'int
Chantal Reynaud, Jean Charlet, Philippe Laublet-
IEEE Transactions on Knowledge and Data Engineering 2004 - 4(2):07-20
Ce document a pour objectif de donner une vue synthetique des principaux aspects du Web semantique. Il est produit par les membres du groupe de travail et de reflexion mis en place dans le cadre de l'action specifique "Web semantique" (decembre 2001). Les premieres discussions, dans ce groupe, ont permis de choisir un certain nombre de themes apparus essentiels a discuter et a analyser afin de mieux comprendre les projets, les realites et les perspectives ouvertes par le projet du Web semantique. Les chapitres suivants de ce numero special presentent synthetiquement chacun des themes :
- les langages du Web semantique
- Meta-donnees et annotations dans le Web semantique
- Les ontologies pour le Web semantique
- L'integration de sources de donnees
- Adaptation et personnalisation dans le Web semantique
- Les Web services semantiques
- Applications du Web semantique.
Chantal Reynaud, Mohand-Sa Hacid-
IEEE Transactions on Knowledge and Data Engineering 2004 - 4(2)
La diversite des sources d'information distribuees et leur heterogeneite est une des principales difficultes rencontrees par les utilisateurs du Web aujourd'hui. L'infrastructure du Web semantique doit permettre leur integration donnant ainsi l'impression a l'utilisateur qu'il utilise un systeme homogene. Les solutions a l'integration d'information proposees dans le cadre du Web semantique tireront parti des recherches concernant les approches mediateurs et les entrepots de donnees. Les premieres realisations sont en cours. Un des premiers verrous scientifiques a lever concerne le passage a l'echelle du Web. Parmi les travaux futurs, dont le developpement doit etre favorise, figurent la mise en oeuvre de systemes de mediation decentralises l'etude des problemes lies a l'integration de donnees multimedias, l'integration temps reel et egalement la prise en compte de la complexite croissante des donnees a integrer, signe d'une evolution vers une integration de connaissances.
Serge Abiteboul, Tova Milo, Omar Benjelloun, Ioana Manolescu, Nicoleta Preda, Bogdan Cautis-
IEEE Transactions on Knowledge and Data Engineering 2004 June
In this paper, we study query evaluation on Active XML documents (AXML
for short), a new generation of XML documents that has recently gained
popularity. AXML documents are XML documents whose content is given
partly extensionally, by explicit data elements, and partly
intensionally, by embedded calls to Web services, which can be invoked
to generate data.
A major challenge in the efficient evaluation of queries over such
documents is to detect which calls may bring data that is relevant for
the query execution, and to avoid the materialization of irrelevant
information. The problem is intricate, as service calls may be
embedded anywhere in the document, and service invocations possibly return data containing
calls to new services. Hence, the detection of relevant calls becomes a
continuous process. Also, a good analysis must take the service
signatures into consideration.
We formalize the problem, and provide algorithms to solve it. We also
present an implementation that is compliant with XML and Web services
standards, and is used as part of the AXML system.Finally, we
experimentally measure the performance gains obtained by a careful
filtering of the service calls to be triggered.
Chantal Reynaud, Brigitte Safar, Hassen Kefi-
IEEE Transactions on Knowledge and Data Engineering 2004
We present a user interface, the OntoRefiner system, for helping the user to navigate numerous retrieved documents after a search querying a semantic portal which integrates a very important number of documents. Retrieved answers are filtered and the user could be provided only with the answers which are, according to im, the most relevant. The refinement process is based on two technologies, dynamic clustering close to Galois lattice structure combined to the use of a domain ontology. The Galois lattice structure provides a sound basis for the query refinement process. However, its construction as a whole is a very costly process. So, we propose an approach based on the use of a domain ontology, avoiding the cosntruction of the whole Galois lattice. In the paper, we present the algorithm and experimental results.
Luc Segoufin, Sandra de Amo, Nicole Bidoit-
IEEE Transactions on Knowledge and Data Engineering 2004 - 14(2):277--298
The paper investigates temporal properties that are invariant with respect to the temporal ordering and that are expressible
by temporal query languages either explicit like FO(less) or implicit like TL. In the case of an explicit time representation,
these ``order invariant'' temporal properties are simply those expressible in the language FO(=). In the case of an implicit
time representation, we introduce a new language, TL(E^i) that captures exactly these properties. The expressive power of
the language TL(E^i) is characterized via a game a la Ehrenfeucht-Fraisse. This provides another proof, using a more
classical technique, that the implicit temporal language TL is strictly less expressive than the explicit temporal language
FO(less). This alternative proof is interesting by itself and opens new perspectives in the investigation of results of the same
kind for more expressive implicit temporal languages than TL.
Ioana Manolescu, Angela Bonifati, Andrei Arion, Andrea Pugliese-
IEEE Transactions on Knowledge and Data Engineering 2004 october
We present the path sequence storage model, a new logical model for storing XML documents. This model partitions XML data and content according to the document paths; and uses ordered sequences as logical and physical structures. Its main advantages are: fast and selective data access plans, and intelligent combination of such plans, based on a summary of the document paths. We validate these benefits through extensive experiments.
Serge Abiteboul, Ioana Manolescu, Nicoleta Preda-
IEEE Transactions on Knowledge and Data Engineering 2004
We present KadoP, a distributed infrastructure for warehousing XML resources in a peer-to-peer framework. KadoP allows users to build a shared, distributed repository of resources such as XML documents, semantic information about such documents, Web services, and collections of such items. KadoP2P builds on
distributed hash tables as a peer communication layer, and ActiveXML as a model for constructing and querying the resources in the peer
network. We describe KadoP's data model, query language, and query processing paradigm.
Serge Abiteboul, Tova Milo, Omar Benjelloun-
IEEE Transactions on Knowledge and Data Engineering 2004 June
The increasing popularity of XML and Web services have given rise to a
new generation of documents, called Active XML documents (AXML), where
some of the data is given explicitly while other parts are given
intensionally, by means of embedded calls to Web services. Web
services in this context can exchange intensional information, using
AXML documents as parameters and results.
The goal of this paper is to provide a formal foundation for this new
generation of AXML documents and services, and to study fundamental
issues they raise. We focus on Web services that are (1) monotone and
(2) defined declaratively as conjunctive queries over AXML documents.
We study the semantics of documents and queries, the confluence of
computations, termination and lazy query evaluation.
Marie-Christine Rousset, Francois Goasdou, Philippe Adjiman, Philippe Chatalic, Laurent Simon-
IEEE Transactions on Knowledge and Data Engineering 2004
Dans un syst
Amar-Djalil Mezaour-
IEEE Transactions on Knowledge and Data Engineering 2004 January - 2:491-502
Les langages de requtes mots-cls pour le web manquent souvent de
prcision lorsqu'il s'agit de rechercher des documents particuliers
difficilement caractrisables par de simples mots-cls (exemple : des
cours java ou des photos de formule 1). Nous proposons un langage
multi-critres de type attribut-valeur pour augmenter la prcision de la
recherche de documents sur le web. Nous avons exprimentalement montr le
gain de prcision de la recherche de documents bas sur ce langage.
Bertram Ludascher and Zack Ives and Ioana Manolescu-
IEEE Transactions on Knowledge and Data Engineering 2004 September - 33(3)
Yannis Papakonstantinou, Ioana Manolescu-
IEEE Transactions on Knowledge and Data Engineering 2004 September - 33
Marie-Christine Rousset-
IEEE Transactions on Knowledge and Data Engineering 2004 - 3298:6--16
In 1984, Peter Patel-Schneider published a paper entitled "Small can be Beautiful in
Knowledge Representation" in which he advocated for limiting the expressive power of knowledge
representation formalisms in order to guarantee good computational properties and thus
make knowledge-based systems usable as part of larger systems.
In this paper, I aim at showing that the same argument holds for the Semantic Web:
if we want to give a chance for the Semantic Web to scale up and to be broadly used, we have to limit
the expressive power of the ontologies serving as semantic marking up of resources.
In addition, due to the scale of the Web and the disparity of its users, it is unavoidable
to have to deal with distributed heterogeneous ontologies.
In this paper, I will argue that a peer-to-peer infrastructure enriched with simple distributed ontologies
(e.g., taxonomies) reconciled through mappings is appropriate and scalable
for supporting the future Semantic Web.
Rallou Thomopoulos, Patrice Buche, Ollivier Haemmerl-
IEEE Transactions on Knowledge and Data Engineering 2004 Janvier:147--158
Les sous-ensembles flous peuvent
D., L., M., A. -
IEEE Transactions on Knowledge and Data Engineering 2004 to appear
Maria Halkidi, Iraklis Varlamis, Michalis Vazirgiannis-
IEEE Transactions on Knowledge and Data Engineering 2004 June - 16(6):685--700
With the unstoppable growth of the WWW, the great success of Web Search Engines, such as Google and Alta-vista, users now turn to the Web whenever looking for information. However, many users are neophytes when it comes to computer science, yet they are often specialists of a certain domain. These users would like to add more semantics to guide their search through WWW material, whereas currently most search features are based on raw lexical content. We show in this article how the use of the incoming links of a page can be used efficiently to classify a page in a concise manner. This enhances the browsing and querying of web pages. In this article, we focus on the tools needed in order to manage the links, and their semantics. We further process these links using a hierarchy of concepts, akin to an ontology, and a thesaurus. This work is demonstrated by an prototype system, called THESUS, that organizes thematic web documents into semantic clusters. Our contributions in this article are the following: 1- a model and language to exploit link semantics information, 2- the THESUS prototype system, 3- its innovative aspects and algorithms, more specifically the novel similarity measure, between web documents applied to different clustering schemes (DB-Scan and COBWEB), 4- a thorough experimental evaluation proving the value of our approach.
Ioana Manolescu, Nicolaas Ruberg, Gabriela Ruberg-
IEEE Transactions on Knowledge and Data Engineering 2004 October
Tova Milo, Omar Benjelloun, Shobhit Raj Mathur-
IEEE Transactions on Knowledge and Data Engineering 2004 Dec
We study the problem of exchanging Active XML documents (AXML for
short), an increasingly popular breed of XML documents, where some
parts of the data are only given {em intensionally}, as
calls to Web services.
As with standard XML, the parties exchanging AXML data can decide on
a {em data exchange schema}, to constrain the structure of
exchanged documents. For AXML data, the schema also specifies which
parts of the data should be explicitly present in the document and
which can be intensional. In order to make sure the data conforms
to the exchange schema, the sender may need to {em materialize}
some of its intensional parts, by invoking the relevant service
calls. The challenge is to perform the document analysis, and the
consequent service invocations efficiently, to avoid a bottleneck in
data exchange.
We formalize the problem and provide efficient algorithms to solve
it. We also present an implementation, and show that our solution
is by far superior to previously proposed algorithms.
Patrice Buche, Ollivier Haemmerl, Juliette Dibie-Barth, Mounir Houhou-
IEEE Transactions on Knowledge and Data Engineering 2004 june
This paper describes a new subsystem of the Sym'Previus knowledge base. This knowledge base contains information useful to help experts in the field of predictive microbiology. Information has several specific properties: it is incomplete, imprecise and heterogeneous. In the pre-existing Sym'Previus knowledge base, stable data are stored in a relational database and data which do not fit the relational structure are stored in a conceptual graph knowledge base. The MIEL language permits to scan simultaneously both bases in a transparent way for the user, using fuzzy queries. The new subsystem described in the paper contains information found on the Web to complete the knowledge base. This information is stored in XML format. Firstly, we extend the XML model of the knowledge base to represent imprecise data as possibility distributions. Secondly, we present the mapping process used to translate a MIEL query into an XML query to scan the XML knowledge base.
Veronique Ventos, Henry Soldano, Thibaut Lamadon-
IEEE Transactions on Knowledge and Data Engineering 2004 June:175-190
Dans beaucoup d'applications il est utile de repr
Veronique Benzaken, Ioana Manolescu, Andrei Arion-
IEEE Transactions on Knowledge and Data Engineering 2004 November
Ollivier Haemmerl, Juliette Dibie-Barth, Eric Salvat-
IEEE Transactions on Knowledge and Data Engineering 2004 Janvier:135--146
Les travaux men
Andrei Arion-
IEEE Transactions on Knowledge and Data Engineering 2004
L'objectif de ce stage est d'etudier l'optimisation des requetes independamment du stockage et des methodes d'indexation proposees par tel ou tel systeme XML. Ce travail propose donc d'identifier des structures persistantes de stockage XML pertinentes pour repondre a des requetes XQuery.Nous cherchons a explorer les problemes specifiques qui apparaissent et de degager les limites en terme de performances que la genericite impose.
Ioana Manolescu-
IEEE Transactions on Knowledge and Data Engineering 2004 September
The Active XML Team-
IEEE Transactions on Knowledge and Data Engineering 2003 July
This paper is a general overview of the Active XML language and system.
Serge Abiteboul, Mihai Preda, Gr Cobena-
IEEE Transactions on Knowledge and Data Engineering 2003 May
The computation of page importance in a huge dynamic graph has
recently attracted a lot of attention because of the web. Page importance,
or page rank is defined as the fixpoint of a matrix equation.
Previous algorithms compute it off-line and require the use of a lot
of extra CPU as well as disk resources (e.g. to store, maintain and
read the link matrix). We introduce a new algorithm OPIC that
works on-line, and uses much less resources. In particular, it does
not require storing the link matrix. It is on-line in that it continuously
refines its estimate of page importance while the web/graph
is visited. Thus it can be used to focus crawling to the most interesting
pages. We prove the correctness of OPIC. We present Adaptive
OPIC that also works on-line but adapts dynamically to changes of
the web. A variant of this algorithm is now used by Xyleme.
We report on experiments with synthetic data. In particular, we
study the convergence and adaptiveness of the algorithms for various
scheduling strategies for the pages to visit. We also report on
experiments based on crawls of significant portions of the web.
Chantal Reynaud, Gloria Giraldo-
IEEE Transactions on Knowledge and Data Engineering 2003 Juillet - 1
This document is about mediation systems integrating services over the Web? First, we give an overview of the mediator approach. In section 3 we focus on integration of services over the Web. We explain which problems arise and we give principles to make the mediator approach applicable in this context. The first principle is a solution to semantic heterogeneity. It exploits results in standardization relative to business transactions. The second principle concerns the scalability of the mediator approach. It emphasizes two points : de-coupling of the various components and strong mediation. In the last section, we present an architecture of a mediation system integrating services, designed in the setting of the PICSEL 2 project. We illustrate with an application of e-commerce in the tourism domain.
Brigitte Safar, Hassen Kefi-
IEEE Transactions on Knowledge and Data Engineering 2003
In this paper we study how to help used to refine his query when the search for documents in ressources has produced too many answers. Using a domain ontology and a set of ressources indexed with terms of this ontology, we show how we can use the immediate readability and intelligibility of the Galois LAttice clustering structure without paying the cost of its whole building. We also show how to build a refined query in interaction with the user.
Nathalie Pernelle, Veronique Ventos, Henri Soldano-
IEEE Transactions on Knowledge and Data Engineering 2003
Alpha-Galois lattices for Conceptual Clustering
Authors : Nathalie Pernelle, Henri Soldano, Veronique Ventos
Universit
Michel Scholl, Stephane Grumbach, Philippe Rigaux, Luc Segoufin-
IEEE Transactions on Knowledge and Data Engineering 2003 - 28(6):563,595
This paper presents dedale, a spatial database system which providesan abstract and non-specialized data model and query language for representating and manipulating geometric data in arbitrarydimension. dedale relies on a logical model based on linearconstraints. The main features of the constraint model are (1) a uniformrepresentation of all sorts of data, including geometric,spatio-temporal or elevation data, (2) an algebraic query language whose formalfoundations constitute a basis for practical query optimization. We show the practical relevance of the approach by describing animplementation which builds on standard technology for datastorage, database indexing and on the parsing and optimization ofSQL.dedale validates the linear constraint model over various applications,proposes a user query language based on SQL which allows to query thedatabase in a purely declarative way, and gives some first steps towards queryoptimization. We believe that this experience is a fruitful step towardsound and consistent database models which hide the complexity ofarbitrary geometric data, while keeping manipulation languages intuitive and efficient.
D. Berre, L., A.-
IEEE Transactions on Knowledge and Data Engineering 2003 June - 2919:468--485
Benjamin Nguyen, Maria Halkidi, Iraklis Varlamis, Michalis Vazirgiannis-
IEEE Transactions on Knowledge and Data Engineering 2003
Dans cet article, nous proposons une nouvelle mesure de similarit\'e, qui s'accompagne de l'utilisation d'algorithmes de clustering utilis\'es dans le domaine des bases de donn\'ees spatiales, afin de permettre le regroupement (clustering) de documents qui seraient caract\'eris\'es par un ensemble de mots dans une hi\'erarchie (ontologie). Les documents web correspondent \`a ce cas de figure. Le probl\`eme principal que nous abordons dans cet article est la d\'efinition d'une mesure de similarit\'e qui poss\`ede une s\'emantique ad\'equate, et qui se calcule assez facilement en terme de complexit\'e. L'algorithme de 'clustering' que nous utilisons est DB-SCAN, que nous avons modifi\'e un tout petit peu afin de construire une hi\'erarchie d'ensembles (clusters). Le syst\`eme global, THESUS, qui utilise les r\'esultats de cet article, est impl\'ement\'e, et nous donnons \'egalement des r\'esultats d'exp\'eriences pour illustrer la pertinence de notre approche.
Benjamin Nguyen-
IEEE Transactions on Knowledge and Data Engineering 2003
Serge Abiteboul, Michalis Vazirgiannis, Evaggelia Pitoura, Dieter Pfoser, George Samaras-
IEEE Transactions on Knowledge and Data Engineering 2003
The challenge of peer-to-peer computing goes beyond simple file sharing. In the DBGlobe project, we view the multitude of peers carrying data and services as a superdatabase. Our goal is to develop a data management system for modeling, indexing and querying data hosted by such massively distributed, autonomous and possibly mobile peers. We employ a service-oriented approach, in that data are encapsulated in services. Direct querying of data is also supported by an XML-based query language. In this paper, we present our research results along the following topics: (a) infrastructure support, including mobile peers and the creation of context-dependent communities, (b) metadata management for services and peers, including locationdependent data, (c) filters for efficiently routing path queries on hierarchical data, and (d) querying using the AXML language that incorporates service calls inside XML documents
Luc Segoufin, Michael Benedikt, Leonid Libkin, Thomas Schwentick-
IEEE Transactions on Knowledge and Data Engineering 2003 - 50:694-751
We study analogs of classical relational calculus in the context of strings.We start by studying string logics. Taking a classical model-theoretic approach,we fix a set of string operations and look at the resulting collection of definablerelations. These form an algebra- a class of n-ary relations for every n, closed under projection andBoolean operations. We show that by choosing the string vocabulary carefully,we get string logics thathave desirable properties: computable evaluation and normal forms.We identify five distinct models and study the differences in their model-theoryand complexity of evaluation. We identifya subset of these models which have additional attractive properties, such as finite VC dimensionand quantifier elimination.Once you have a logic, the addition of free predicate symbolsgives you a string query language. The resulting languages have attractive closureproperties from a database point of view: while SQL does not allow thefull composition of string pattern-matching expressions with relationaloperators, theses logics yield compositional query languagesthat can capture common string-matching queries while remaining tractable.For each of the logics studied in the first part of the paper, we studyproperties of the corresponding query languages. We give bounds on thedata complexity of queries, extend the normal form results fromlogics to queries, and show that the languages have correspondingalgebras expressing safe queries.
Rallou Thomopoulos, Patrice Buche, Ollivier Haemmerl-
IEEE Transactions on Knowledge and Data Engineering 2003 July - 2746:54--68
In the context of a microbiological application, our study proposes to extend the Conceptual Graph Model in order to allow one: (i) to represent imprecise data and queries that include preferences, by using fuzzy sets (from fuzzy set theory) in concept vertices, in order to describe either an imprecise concept type or an imprecise referent; (ii) to query a conceptual graph that may include imprecise data (factual graph) using a conceptual graph that may include preferences (query graph). This is performed in two steps: firstly by extending the projection operation to fuzzy concepts, secondly by defining a comparison operation characterised by two matching degrees: the possibility degree of matching and the necessity degree of matching between two graphs, and particularly between a query graph and a factual graph.
Brigitte Safar, Hassen Kefi-
IEEE Transactions on Knowledge and Data Engineering 2003 November:597--601
We present a system for interactively querying a topical ressources repository and for navigating the set of responses obtained from the repository. Our system makes use of a domain ontology for dynamically constructing a Gallois lattice. Nodes in this structure are resources clusters which are interactively examined by the used during the construction process.
Serge Abiteboul, Tova Milo, Gr Cobena, Ioana Manolescu, Angela Bonifati-
IEEE Transactions on Knowledge and Data Engineering 2003
The advent of XML as a universal exchange format and of Web services
as a basis for distributed computing, has fostered the apparition of a
new class of documents: {\em dynamic XML documents}. These are XML documents
where some of the data is given explicitly while other parts are given only
intentionally by means of embedded calls to web services, that can be
called to generate the required information.
By the sole presence of Web services, dynamic documents already include
inherently some form of distributed computation. A
higher level of distribution that also allows (fragments of) dynamic
documents to be distributed and/or replicated over several sites is highly
desirable in today's Web architecture, and in fact is also relevant
for regular (non dynamic) documents.
The goal of this paper is to study new issues raised by the distribution
and replication of dynamic XML data.
Starting from a data model and a query
language, we describe a complete framework for distributed and replicated
dynamic XML
documents. We provide a comprehensive cost
model for query evaluation and show how it applies to user queries and
service calls. Finally, we describe an algorithm that, for a given
peer, chooses data and services that the peer should replicate to improve the
efficiency of maintaining and querying its dynamic documents.
Serge Abiteboul, Bernd Amann, Tova Milo, Omar Benjelloun, Fred Dang Ngoc-
IEEE Transactions on Knowledge and Data Engineering 2003 June:289--300
XML is becoming the universal format for data exchange between applications.
Recently, the emergence of Web services as standard means of publishing and
accessing data on the Web introduced a new class of XML documents, which we
call {em intensional} documents. These are XML documents where some of the
data is given explicitly while other parts are defined only intensionally by
means of embedded calls to Web services.
When such documents are exchanged between applications, one has the choice to
materialize the intensional data (i.e. to invoke the embedded calls) or not,
before the document is sent. This choice may be influenced by various
parameters, such as performance and security considerations. This paper
addresses the problem of guiding this materialization process.
We argue that, just like for regular XML data, schemas (ala DTD and XML Schema)
may be used to control the exchange of intensional data and, in particular, to
determine which data should be materialized before sending a document, and
which should not. We formalize the problem and provide algorithms to solve it.
We also present an implementation that complies with real life standards for
XML data, schemas, and Web services.
Ioana Manolescu, Stefano Ceri, Marco Brambilla, Pierro Fratemali, Sara Comai-
IEEE Transactions on Knowledge and Data Engineering 2003
Recent web service standards, based on WSDL have been adopted as a process-to-process communication paradigm, enabling large-scale, inter-organization application interaction. In this paper, we extend a declarative model and language for specifying data-intensive Web applications, in order to model complex interactions between the application (driven by the user) and remote processes (represented as services). Thus, we demonstrate how a declarative web site specification method can be effectively turned into a tool for web service usage and composition, particularly suited for interactions between human users and web services.
Amar-Djalil Mezaour-
IEEE Transactions on Knowledge and Data Engineering 2003 September:63--74
Keyword-based web query languages suffer from a lack of precision when
searching for a precise kind of documents. Indeed, some documents cannot be
simply characterized by a list of keywords. For example, searching for on-line
pictures dealing with formula one using only simple keywords with
general-purpose search-engines gives imprecise answers. This imprecision is
due to the method that considers that a relevant document to a query is one
that contains a lot of query keywords occurrences. This method is totally
unefficient for poor textual-content documents like pictures, video
streams... On the other hand, keyword based languages are often not
powerful enough for expressing sophisticated document search like on-line
java or c++ tutorials.
We propose "WeQueL" a multi-criteria query langage for a better
characterization of documents. The aim is to increase the precision of
document retrieval on the web. In our experiments, we show the gain in
accuracy for web document searching using our language.
Rallou Thomopoulos, Patrice Buche, Ollivier Haemmerl-
IEEE Transactions on Knowledge and Data Engineering 2003 October - 2871:98--107
This paper presents an information system developed to help the assessment of the microbiological risk in food products. UQS (Unified Querying System) is composed of two distinct bases (a relational database and a conceptual graph knowledge base) which are integrated by means of a uniform querying language. The specificity of the system is that both bases include fuzzy data. Moreover, UQS allows the expression of preferences into the queries, by means of the fuzzy set theory.
Marie-Christine Rousset, Chantal Reynaud-
IEEE Transactions on Knowledge and Data Engineering 2003 - 29:03,22
An information integration system provides a uniform query interface to a collectionof distributed and heterogeneous information sources, giving users or other agents theillusion that they interrogate a centralized andhomogeneous information system.In this paper, we focus on the use of knowledge representation techniques for building mediators for information integration. A mediator is based on the specification of a single mediated schema describing a domain of interest, and on a set of source descriptions expressing how the content of each source available to the system is related to the domain of interest.These source descriptions, also called mappings because they model the correspondence between the mediated schema and the schemas of the data sources, play a central role in the query answering process. We present two recent information integration systems, namely Picsel and Xyleme, which are illustrative of two radically different choices concerning the expressivity of the mediated schema.
Rallou Thomopoulos, Patrick Bosc, Patrice Buche, Ollivier Haemmerl-
IEEE Transactions on Knowledge and Data Engineering 2003 July:173--178
In previous studies, we have extended the conceptual graph model, which is a knowledge representation model belonging to the family of semantic networks, to be able to represent fuzzy values. The basic conceptual graph model has a logical interpretation in first-order logic.
In this paper, we focus on the logical interpretation of the conceptual graph model extended to fuzzy values: we use logical implications stemming from fuzzy logic, so as to extend the logical interpretation of the model to fuzzy values and to comparisons between fuzzy conceptual graphs.
Serge Abiteboul-
IEEE Transactions on Knowledge and Data Engineering 2003 - 2681:4--13
The paper is about warehousing XML content in a P2P environment. The role of an XML warehouse is to offer a centralizer entry point to access a variety of information in an enterprise and provide functionalities to acquire, enrich, monitor and maintain this information. We consider a number of reasons to use a P2P approach to such warehouses, e.g., reducing the cost or sharing the management of information. We mention research issues that are raised.
Serge Abiteboul, Tova Milo, Gr Cobena, Ioana Manolescu, Angela Bonifati, J Baumgarten, Cosmin Cremarenco, Florin Dragan, Nicoleta Preda-
IEEE Transactions on Knowledge and Data Engineering 2003
Chantal Reynaud, Gloria Giraldo-
IEEE Transactions on Knowledge and Data Engineering 2003 :59--68
This document is about mediation systems integrating services over the Web to make them usable to final users. As services on the Web are numerous and heterogeneous, as they must sometimes be combined to satisfy final users requirements, interfaces giving the illusion of a convivial unique centralized and homogeneous access are necessary. In this paper we propose to use a mediator approach to design such an interface.First, we give an overview of the mediator approach. Then Section 3 is about problems specific to mediation systems integrating services on the Web. We explain which problems arise and we give principles to make the mediator approach applicable in this context. The first principle is a solution to semantic heterogeneity. It exploits results in standardization relative to business transactions. The second principle concerns the scalability of the mediator approach. It emphasizes two points : de-coupling of the various components and strong mediation. In the last section, we present an architecture of a mediation system integrating services, designed in the setting of the PICSEL 2 project. We illustrate with an application of e-commerce in the tourism domain.
Serge Abiteboul, Gr Cobena, Benjamin Nguyen, Antonella Poggi-
IEEE Transactions on Knowledge and Data Engineering 2003
We propose a new methodology, a language and tools
for the design and construction of Web data warehouses.
Our approach is Service Oriented, in that
our framework makes an extensive use of Web Services
and semi-structured data (XML) to define the
data structures, the services and the connections between
them. We present an experimental version of
the system that has been built using this framework.
It uses external Web Services such as Google
Gr Cobena, Benjamin Nguyen, Michalis Vazirgiannis-
IEEE Transactions on Knowledge and Data Engineering 2003
Poster at ECDL 2003
Marie-Christine Rousset, Chantal Reynaud-
IEEE Transactions on Knowledge and Data Engineering 2003
An information integration agent provides a uniform query interface to a collection of distributed and heterogeneous information sources, giving users or other agents the illusion that they interrogate a centralized and homogeneous information system. In this chapter we focus on integration information agents that follow a mediator approach. A mediator is based on the specification of a single mediated schema describing a domain of interest, and on a set of source descriptions expressing how the content of each source available to the system is related to the domain of interest. These source descriptions, also called mappings because they model the correspondence between the mediated schema and the schemas of the data sources, play a central role in the query answering process. We present two recent information integration agents, namely Picsel and Xyleme, which are illustrative of two radically different choices concerning the expressivity of the mediated schema.
Marie-Christine Rousset, Francois Goasdou-
IEEE Transactions on Knowledge and Data Engineering 2003 - 18(5):60,65
In this paper, we define a simple but scalable framework for peer-to-peer data sharing systems, in which the problem of answering queries over a network of semantically related peers is always decidable. Our approach is characterized by a simple class-based language for defining peer schemas as hierarchies of atomic classes, and mappings as inclusions of logical combinations of atimic classes. We provide an anytime and incremental method for computing all the certain answers to a query posed to a given peer such that the answers are orderes from the ones involving peers close to the queried peer to the ones involving more distant peers.
Luc Segoufin, Michael Benedikt, Martin Grohe, Leonid Libkin-
IEEE Transactions on Knowledge and Data Engineering 2003 - 66(1):169-206
It is known that standard query languages for constraint databaseslack the power to express connectivity properties. Such properties areimportant in the context of geographical databases, where onenaturally wishes to ask queries about connectivity (what are theconnected components of a given set?) or reachability (is there a pathfrom A to B that lies entirely in a given region?). No existingconstraint query languages that allow closed form evaluation canexpress these properties.In the first part of the paper, we show that in principle there is noobstacle to getting closed languages that can express connectivity andreachability queries. In fact, we show that adding any topologicalproperty to standard languages like FO LIN and FO POLY results in aclosed language. In the second part of the paper, we look fortractable closed languages for expressing reachability andconnectivity queries. We introduce path logic, which allows one tostate properties of paths with respect to given regions. We show thatit is closed, has polynomial time data complexity for linear andpolynomial constraints, and can express a large number of reachabilityproperties beyond simple connectivity. Query evaluation in the logicinvolves obtaining a discrete abstraction of a continuous path, andmodel-checking of temporal formulae on the discrete structure.
Rallou Thomopoulos, Patrice Buche, Ollivier Haemmerl-
IEEE Transactions on Knowledge and Data Engineering 2003 October - 140(1):111--128
Serge Abiteboul, Tova Milo, Omar Benjelloun, Bernd Aman, J Baumgarten, Frederic Dang Ngoc-
IEEE Transactions on Knowledge and Data Engineering 2003 September:1093--1096
Claude Delobel, Dan Vodislav, Marie-Christine Rousset, Chantal Reynaud, Jean-Pierre Sirot-
IEEE Transactions on Knowledge and Data Engineering 2003 March - 44(3):267,298
Xyleme is a huge warehouse integrating XML data of the web. Xyleme considers a simple data model with data trees and tree types for describing the data sources, and a simple query language based on tree queries with boolean conditions. The main components of the data model are a mediated schema modeled by an abstract tree type, as a view of a set of tree types associated with actual data trees, called concrete tree types, and a mapping expressing the connection between the mediated schema and the concrete tree types. The first contribution of this paper is formal: we provide a declarative model-theoretic semantics for Xyleme tree queries, a way of checking tree query containment, and a characterization of tree queries as a composition of branch queries. The other contributions are algorithmic and handle the potentially huge size of the mapping relation which is a crucial issue for semantic integration and query evaluation in Xyleme. First, we propose a method for pre-evaluating queries at comile time by storing some specific meta-information about the mapping into map translation tables. These map translation tables summarize the set of all the branch queries that can be generated from the mediated schema and the set of all the mappings. Then, we propose different operators and strategies for relaxing queries which, having an empty map translation table, will have no answer if they are avaluated against the data. Finally, we present a method for semi-automatically generating the mapping relation.
Ioana Manolescu, Stefano Ceri, Marco Brambilla, Sara Comai, Pierro Fraternali-
IEEE Transactions on Knowledge and Data Engineering 2003
While the Web consolidates as the ubiquitous application delivery platform, the features of Web applications evolve to cover new requirements, like the capability of managing complex workflows spanning multiple users and organizations. This scenario challenges the Web engineering methods to address a broader class of applications. This paper introduces Workflow Driven Hypertexts, defined as Web-enabled hypertextual applications serving the workflow of multiple users, and proposes a design method integrating process, data, and hypertextual modeling, expressly conceived to support the development of this class of applications.
Ioana Manolescu, Stefano Ceri, Marco Brambilla, Sara Comai, Piero Fraternali-
IEEE Transactions on Knowledge and Data Engineering 2003
While the Web consolidates as the ubiquitous application delivery platform, the features of Webapplications evolve to cover new requirements, like the capability of managing complex workflowsspanning multiple users and organizations. This scenario challenges the Web engineering methods toaddress a broader class of applications. This paper introduces workflow-driven hypertexts, defined asWeb-enabled hypertextual applications serving the workflow of multiple users, and proposes a designmethod integrating data, hypertext, and workflow modeling concepts for modeling lightweight Webenabledworkflows; this approach extends the benefits of high-level conceptual modeling and automaticcode generation to a much broader class of Web applications.
D. Berre, L.-
IEEE Transactions on Knowledge and Data Engineering 2003 - 2919:452--467
The SAT 2003 Competition ran in February - May 2003, in conjunction with SAT'03 (the Sixth Fifth International Symposium on the Theory and Applications of Satisfiability Testing). One year after the SAT 2002 competition, it was not clear that significant progress could be made in the area in such a little time. The competition was a success -
Ioana Manolescu, Alberto Lerner-
IEEE Transactions on Knowledge and Data Engineering 2003
Benjamin Nguyen, Maria Halkidi, Iraklis Varlamis, Michalis Vazirgiannis-
IEEE Transactions on Knowledge and Data Engineering 2003 November - 12(4):320--332
The requirements for effective search and management of the WWW content are stronger than ever. Document retrieval in the WWW is problematic with regards to three aspects: i. Not taking adequately into account link semantics for characterizing documents ii. Basing document similarity on binary keyword matching, thus the matching of different yet semantically close keywords would return false. iii. Presenting search results in terms of lengthy URL lists. The previous observations lead us to the adoption of an ontology and a thesaurus to exploit links, in order to provide a better characterization of web documents. The enhancement of the characterization of the documents makes operations such as clustering become very interesting. We have devised a system called THESUS to achieve these goals. The system deals with an initial set of web pages, extracts keywords from all pages' incoming links and converts them to semantics by mapping them to a domain's ontology. Subsequently, a clustering algorithm is applied to discover document clusters. The clustering results are used for: i. Web pages classification ii. speeding up (in terms of pruning) keyword based queries. The system is fully functional and testing proved the quality of its results. The innovations lie in the integrated use of: i. incoming link semantics as the fundamental document characterization mechanism, ii. the conversion from keywords to semantics and iii. the clustering of web content using the semantic similarity between documents.
Luc Segoufin-
IEEE Transactions on Knowledge and Data Engineering 2003
We study the complexity bound of validating XML documents, viewed as labeled unranked ordered trees, against various
typing systems like DTDs, XML schemas, tree automata... We also consider query evaluation complexities for various
fragments of XPath. For both problems, validation and query evaluation, we consider data and combined complexity
bounds.
Pierre Senellart-
IEEE Transactions on Knowledge and Data Engineering 2003 September
I present in this paper a method to discover the set of webpages contained in a logical website, based on the link structure of the Web graph. Such a method is useful to identify the boundaries of what to crawl, in the context of Web archiving. For this purpose, I combine the use of an online version of the preflow-push algorithm, an algorithm for the maximum flow problem in traffic networks, and of the Markov CLuster (MCL) algorithm. The latter is used on a crawled portion of the Web graph in order to build a seed of initial webpages, a seed which is extended by the former. Experiments on subsites of the INRIA Website, which give satisfactory results, are described.
Ioana Manolescu, Angela Bonifati, Andrei Arion, Gianni Costa, Andrea Pugliese, Sandra dAguanno-
IEEE Transactions on Knowledge and Data Engineering 2003 :1065--1068
Nathalie Pernelle, Veronique Ventos, Henry Soldano-
IEEE Transactions on Knowledge and Data Engineering 2003 October
The main drawback of Galois lattices is that their size can be very large when dealing with real-world data sets. One way to reduce the number of nodes is to use a language with a weak expressivity. In this paper, we propose an approach based on the new notion of alpha extension where the number of nodes is reduced according to a change of the extensional function.
Gr Cobena, Talel Abdessalem, Yassine Hinnach-
IEEE Transactions on Knowledge and Data Engineering 2002
Change detection is an important part of version management for databases and document archives.
The success of XML has recently renewed interest in change detection on trees and semi-structured data,
and various algorithms have been proposed.
We study here different algorithms and representations of changes based on their formal definition
and on experiments conducted over XML data from the Web.
Our goal is to provide an evaluation of (i) the quality of the results (ii) the performance of the tools.
We also consider in which context each of these algorithms can be best used.
Tova Milo, Haim Kaplan, Ronen Shabo-
IEEE Transactions on Knowledge and Data Engineering 2002
Motivated by a recent application in
XML search engines
we study the problem of labeling the nodes of
a tree (XML file) such that given the labels of two nodes
one can determine whether one node is an ancestor of
the other. We describe several new prefix-based labeling
schemes, where an ancestor query roughly amounts to testing whether one label
is a prefix of the other.
We compare our new schemes to a simple interval-based scheme
currently used by search engines, as well as, to schemes
with the best theoretical guarantee
on the maximum label length.
We performed our experimental evaluation on real XML data
and on some families of random trees.
Serge Abiteboul, Gr Cobena, Julien Masanes, Gerald Sedrati-
IEEE Transactions on Knowledge and Data Engineering 2002
The web is a more and more valuable source of information and organizations are involved in archiving (portions of) it for various purposes, e.g., the Internet Archive (www.archive.org).
A new mission of the French National Library (BnF) is the ``d
Serge Abiteboul, Benjamin Nguyen-
IEEE Transactions on Knowledge and Data Engineering 2002 January
In this paper, we present both an analysis, and experimental
results of an efficient algorithm for subset detection. The data
structure used in this algorithm is a hash-tree. The algorithm
was used as the core of the subscription system in an XML data
warehouse. This context sets the range of the parameters of the problem. Both t\
he analysis, and experimental results
concur to show that the system can process thousands of documents
per second, on a data base of millions of subscriptions.
Serge Abiteboul, Tova Milo, Omar Benjelloun, Ioana Manolescu, Roger Weber-
IEEE Transactions on Knowledge and Data Engineering 2002
We present Active XML (AXML, for short), a declarative framework thatharnesses web services for data integration, and is put to work in apeer-to-peer architecture. It is centered around \emph{AXMLdocuments} which are XML documents that may contain calls to webservices. The language allows both the specification of suchdocuments and the definition of new web services based on them.While documents with embedded calls were already proposed before, AXMLis first to actually turn calls to web services embedded in XMLdocuments into a powerful tool for web-scale data integration. Inparticular, the language includes linguistic features to control thetiming of service call activations, the lifespan of data, and theexchange of extensional and intensional data. Various scenarios arecaptured, such as mediation, data warehousing, and distributedcomputation. A first prototype is described.
Serge Abiteboul, Tova Milo, Omar Benjelloun, Ioana Manolescu, Roger Weber-
IEEE Transactions on Knowledge and Data Engineering 2002
We demonstrate how the Active XML framework can be used to build a distributed auctioning system.
Alain Bidault-
IEEE Transactions on Knowledge and Data Engineering 2002
Chantal Reynaud, Brigitte Safar-
IEEE Transactions on Knowledge and Data Engineering 2002 Janvier - 1:201-210
Information integration over existing sources that are distributed and possibly heterogeneous is complex. Mediator systems have in common the use of a domain model, also called ontology.The paper addresses the benefits coming from using such an ontology in order to helpusers to pose queries and present the inference mechanisms which have been developed in the context of the PICSEL project. We illustrate these works with examples coming the tourism domain.
Marie-Christine Rousset, Francois Goasdou-
IEEE Transactions on Knowledge and Data Engineering 2002 :267-271
In this paper, we characterize the logical correspondence between conjunctive queries and concept descriptions. We exhibit a necessary and
sufficient condition for the compilation of a conjunctive query into an equivalent ALE concept description. We provide a necessary and sufficient
condition for the approximation of conjunctive queries by maximally subsumed ALN concept descriptions.
Bernd Amann, Philippe Rigaux-
IEEE Transactions on Knowledge and Data Engineering 2002
De m
Serge Abiteboul, Mihai Preda, Gr Cobena-
IEEE Transactions on Knowledge and Data Engineering 2002 - 25(1):27--33
The computation of page importance in a huge dynamic graph has recently attracted a lot of attention
because of the web. Page importance or page rank is defined as the fixpoint of a matrix equation.
Previous algorithms compute it off-line and require the use of a lot of extra CPU as well as disk resources
in particular to store and maintain the link matrix of the web. We briefly discuss a new algorithm that
works on-line, and uses much less resources. In particular, it does not require storing the link matrix. It
is on-line in that it continuously refines its estimate of page importance while the web/graph is visited.
When the web changes, page importance changes as well. We modify the algorithm so that it adapts
dynamically to changes of the web. We report on experiments on web data and on synthetic data.
Serge Abiteboul, Mihai Preda, Gr Cobena-
IEEE Transactions on Knowledge and Data Engineering 2002 March - 25
The computation of page importance in a huge dynamic graph has
recently attracted a lot of attention because of the web. Page
importance or page rank is defined as the fixpoint of a matrix
equation. Previous algorithms compute it off-line and require the use
of a lot of extra CPU as well as disk resources in particular to store
and maintain the link matrix of the web. We briefly discuss a new
algorithm that works on-line, and uses much less resources. In
particular, it does not require storing the link matrix. It is
on-line in that it continuously refines its estimate of page
importance while the web/graph is visited. When the web changes, page
importance changes as well. We modify the algorithm so that it adapts
dynamically to changes of the web. We report on experiments on web
data and on synthetic data.
Serge Abiteboul, Gr Cobena, Benjamin Nguyen, Antonella Poggi-
IEEE Transactions on Knowledge and Data Engineering 2002 October
Dans cet article, nous nous int
Marie-Christine Rousset, Chantal Reynaud, Francois Goasdou, Brigitte Safar, H Gagliardi, Alain Bidault, Christine Froidevaux-
IEEE Transactions on Knowledge and Data Engineering 2002 - 2:09--59
Le nombre croissant de donnees accessibles via des reseaux (intranet, internet, etc.) pose le probleme de l'integration de sources d'information pre-existantes, souvent distantes et heterogenes, afin de faciliter leur interrogation par un large public. Une des premieres approches proposees en integration d'informations prone la construction de mediateurs. Un mediateur joue un role d'interface de requetes entre un utilisateur et des sources de donnees. Il donne a l'utilisateur l'illusion d'interroger un systeme homogene et centralise en lui evitant d'avoir a trouver les sources de donnees pertinentes pour sa requete, de les interroger une a une, et de combiner lui-meme les informations obtenues. L'objectif de cet article est de presenter le projet PICSEL qui offre un environnement declaratif de construction de mediateurs. PICSEL se distingue des systemes d'integration d'informations existants par la possibilite d'exprimer le schema du mediateur dans un langage (Carin) combinant le pouvoir d'expression d'un formalisme a base de regles et d'un formalisme a base de classes (la logique de description ALN). PICSEL integre un module d'affinement de requetes, premiere brique d'un module de dialogue cooperatif entre un mediateur et ses utilisateurs.
Chantal Reynaud, Gloria Giraldo-
IEEE Transactions on Knowledge and Data Engineering 2002 Mai
A mediator system is composed of a global schema, or domain ontology, whose role is twofold. First, it conects all the autonomous and distributed sources to each other. Second, it allows users to pose queries in terms of the mediated schema rather than directly in terms of the sources schemas. In this paper, we focus on the automation of the construction of an ontology in the context of a domain-specific mediator querying XML documents.The ontology that we want to build is a schema based on classes. The approach is based on the DTDs associated to the XML documents. The paper describes it and presents results coming from the first experiments.
Serge Abiteboul, Sophie Cluet, Tova Milo-
IEEE Transactions on Knowledge and Data Engineering 2002 - 275(1-2):179-213
A primary motivation for new database technology is to provide support
for the broad spectrum of multimedia data available notably through
the network. These data are stored under different formats: relational
or ODMG model (in databases), SGML or LaTex (documents), DX formats
(data exchange formats in scientific data), Step (CAD/CAM data),
etc. Their integration is a very active field of research and
development. In this paper, we provide a formal foundation to
facilitate the integration of such heterogeneous data and the
maintenance of heterogeneous replicated data.
Serge Abiteboul, Amelie Marian, Gr Cobena-
IEEE Transactions on Knowledge and Data Engineering 2002
We present a 'diff' algorithm for XML data. This work is
motivated by the support for change control in the context of the
Xyleme project that is investigating warehouses capable of storing
massive volume of XML data found on the web. Because of the context,
the algorithm has to be very efficient in terms of speed and memory
space even at the cost of some loss of ''quality''. Also, it
considers, besides insertions, deletions and updates (standard in
diffs), a 'move' operation on subtrees that is essential in the
context of XML.
Intuitively, our 'diff' algorithm uses signatures to match (large)
subtrees that were left unchanged between the old and new versions.
Such exact matchings are then possibly propagated to ancestors and
descendants to obtain more matchings. We provide a performance
analysis of the algorithm. We show that it runs in average in linear
time vs. quadratic time for previous algorithms. We present
experiments on synthetic data that confirm the analysis. This linear
time is obtained by trading some quality. We present experiments
(again on synthetic data) that show that the output of the algorithm
is reasonably close to the optimal in terms of quality. Finally we
present experiments on a small set of versions of XML pages found on
the Web.
Keywords: XML, Diff, Delta, Tree Pattern Matching, Change Control, Version.
Serge Abiteboul-
IEEE Transactions on Knowledge and Data Engineering 2002
The web is turning from a collection of static documents to a global
source of dynamic knowledge. First, HTML is increasingly complemented
by the more structured XML format augmented with mechanisms for
providing semantics such as RDF. Also, we believe that the very
passive vision of web sites promoted by the current HTTP will soon
turn into a much more active one based on pub/sub and web services.
Thus, the monitoring of the web is likely to raise new challenges for
many years. We consider here some applications involving web
monitoring and some issues they raise.
Tova Milo, Haim Kaplan, Edith Cohen-
2002
We present algorithms to label the nodes of an XML tree which is
subject to insertions and deletions of nodes. The labeling is done
such that (1)~we label each node immediately when it is inserted and
this label remains unchanged, and (2)~from a pair of labels alone, we
can decide whether one node is an ancestor of the other. This problem
arises in the context of XML databases that support queries on the
structure of the documents as well as on the changes made to the
documents over time. We prove that our algorithms assign the shortest
possible labels (up to a constant factor) which satisfy these
requirements.
We also consider the same problem when "clues" that provide guarantees
on possible future insertions are given together with newly inserted
nodes. Such clues can be derived from the DTD or from statistics on
similar XML trees. We present algorithms that use the clues to
assign shorter labels. We also prove that the length of our labels is
close to the minimum possible.
Marie-Christine Rousset, Alexandre Termier, Mich Sebag-
2002
The paper aims at extracting common tree structures in heterogeneous XML data. This approach extends Frequent Item Set techniques to tree
structure representation. We provide an algorithm which extracts all subtrees embedded in the data above a given frequency threshold. As a
result, this technique provides both a clustering of the data and a symbolic characterization of each cluster. This characterisation is made of the
maximally specific subtrees embedded in all the data trees of each cluster. The scalability of the algorithm is investigated on artificial
medium-sized data.
Luc Segoufin, Martin Grohe-
IEEE Transactions on Knowledge and Data Engineering 2002 - 3(3):336-358
One important class of spatial database queries is the class oftopological queries, i.e. queries invariant under homeomorphisms. We study topological queries expressible in the standard querylanguage on spatial databases, first-order logic with various amountsof arithmetic. Our main technical result is a combinatorialcharacterization of the expressive power of topological first-orderlogic on regular spatial databases.
Catriel Beeri, Bernd Amann, Michel Scholl, Irini Fundulaki-
2002
This paper deals with modeling issues
encountered in the context of the integration of
heterogeneous XML sources in the C-Web project. The goal of this project is to design and implement tools for building semantic portals
for communities of users sharing XML Web resources
in a common domain of interest.
The backbone of such a portal is a domain ontology
which is used as the common interface component
or integrating, querying and exchanging information.
Sophie Cluet, Vincent Aguilera, Fr Boiscuvier, Bruno Koechlin-
2002
The main paradigm underlying a query language for
XML is that of selecting data by matching patterns
described by path expressions. These path
expressions describe traversals along sub-elements
with variables getting bound to nodes along these
paths. Hence, a key operation in query processing
over XML is to produce a set of bindings for
variables, given a pattern consisting of several
path expressions.
In this paper we focus on the implementation of a
special purpose algebraic operator, called
PatternScan, that captures this paradigm.
A PatternScan filters a forest of XML documents
according to some pattern and returns a set
with one tuple per matching tree structure found
somewhere in the forest.
Given an inverted index based on a numbering
scheme for tags/words, we show that a PatternScan
can be evaluated as a sequence of joins between
posting lists, without the need for intermediate
data structures. Furthermore, the evaluation
time is in most cases linear both in the size of
the pattern and the average number of postings
per pattern node.
We believe that those properties are particularly
interesting when considering applications such as
web-scale XML search engines, where the time spent
to compute the first results and the memory overhead of a single query must be as low
as possible.
Experimental results from our system, based on
documents and queries from the XML Benchmark Project XmlBenchmark,
show that the proposed algorithm can dramatically
improve the execution time of a large class of XML queries, together with a relatively low overhead in terms of space needed by the index.
Brigitte Safar, Alain Bidault, Christine Froidevaux-
IEEE Transactions on Knowledge and Data Engineering 2002 - 2:653-662
La recherche d'informations au sein de sources multiples et heterogenes
est complexe. Les approches mediateur proposent des
solutions qui exploitent une ontologie du domaine. dans ce papier, nous decrivons comment l'ontologie peut aider a
evaluer la proximite de deux requetes et permet,
lorsque la requete posee par l'utlisateur est en echec,
de choisir une requete proche. le travail presente
a ete effectue au sein du projet PICSEL. Les exemples cites sont issus
du domaine des produits du tourisme,
domaine d'experimentation choisi au sein du projet.
Catriel Beeri, Bernd Amann, Michel Scholl, Irini Fundulaki-
IEEE Transactions on Knowledge and Data Engineering 2002
In this paper we propose a mediator architecture for the querying and integration of
Web-accessible XML data sources. Our contributions are (i) the definition of a simple but
expressive mapping language, following the local as view approach and describing XML
resources as local views of some global schema, and (ii) efficient algorithms for rewriting
user queries according to existing source descriptions. The approach has been validated
by the StyX prototype.
Chantal Reynaud, Brigitte Safar-
IEEE Transactions on Knowledge and Data Engineering 2002 Octobre
An information integration system provides a uniform query interface to a collection of autonomous and distributed sources, connected to each other thanks to a global mediated schema, called domain ontology. The problem addressed in the paper is how to represent such an ontology into CARIN-ALN, a formalism combining classes and rules. We focus on the choices for representing classes, properties and constraints using the characteristics of the formalism. We also propose a method in two steps for representing a domain ontology in the framework of a mediator. The first step is directed by the formalism and the functionalities of the mediator. The second step is an optimization phase guided by the way functionalities of the mediator are implemented.
Brigitte Safar, Alain Bidault, Christine Froidevaux-
IEEE Transactions on Knowledge and Data Engineering 2002 :235-239
Answering queries on remote and heterogeneous sources
is a complex task. Mediators provide techniques that exploit a domain ontology.
To propose a cooperative approach to repair queries
without answers, we introduce the notion of solution which
is a query close to the user's query.. we describe
how ontologies can be used to evaluate the similarity of two
predicates ans then of two queries. The work presented has been
developed in the PICSEL project. Examples come from
the domain of tourism, the experimentation domain
chosen for this project.
Catriel Beeri, Bernd Amann, Michel Scholl, Anne-Marie Vercoustre, Irini Fundulaki-
IEEE Transactions on Knowledge and Data Engineering 2002
Marie-Christine Rousset-
IEEE Transactions on Knowledge and Data Engineering 2002 - 17(2)
The main issue for a Web ontology language is less the choice of a language for describing ontologies than the choice of a language for
representing mappings between ontologies.
Serge Abiteboul, Sophie Cluet, Guy Ferran, Marie-Christine Rousset-
IEEE Transactions on Knowledge and Data Engineering 2002 - 39:225-238
The current development of the Web and the generalization of XML technonology provides a major opportunity to radically change the
management of distributed data. We have developed a prototype of a dynamic warehouse for XML data, namely Xyleme. In the present paper, we
briefly present some motivation and important aspects of the work performed in the framework of the Xyleme project.
Marie-Christine Rousset, Alexandre Termier, Mich Sebag-
2002
In this paper, we consider the problem of searching frequent trees
from a collectionof tree-structured data modeling XML data.
The TreeFinder algorithm aims at finding trees, such that their exact or
perturbed copies are frequent in a collection of labelled trees.
To cope with complexity issues, TreeFinder is correct but not complete:
it finds a subset of the actually frequent trees. The default of completeness
is experimentally
investigated on artificial medium size datasets; it is shown that TreeFinder
reaches completeness or falls short to it for a range of experimental settings.
Victor Vianu, Luc Segoufin-
IEEE Transactions on Knowledge and Data Engineering 2002
This paper investigates the on-line validation of streaming XML documents withrespect to a DTD, under memory constraints. We first consider validation usingconstant memory, formalized by a finite-state automaton (fsa). We examinetwo flavors of the problem, depending on whether or not the XML document isassumed to be well-formed. The main results of the paper provide conditions onthe DTDs under which validation of either flavor can be done using an fsa. For DTDs that cannot be validated by an sc fsa, we investigate twoalternatives. The first relaxes the constant memory requirement by allowing astack bounded in the depth of the XML document, while maintaining thedeterministic, one-pass requirement. The second approach consists in refiningthe DTD to provide additional information that allows validation by an fsa.
Chantal Reynaud, Gloria Giraldo-
IEEE Transactions on Knowledge and Data Engineering 2002
Sophie Cluet, Tova Milo, Vincent Aguilera, Pierangelo Veltri, Dan Vodislav-
IEEE Transactions on Knowledge and Data Engineering 2002
We are interested in defining and querying views in a huge andhighly heterogeneous XML repository (Web scale). In this context, view definitions are very large, involving lots of sources, and there is no apparent limitation totheir size. This raises interesting problems that we address in thepaper: (i) how to distribute views over several machines withouthaving a negative impact on the query translation process; (ii) howto quickly select the relevant part of a view given a query; (iii) howto minimize the cost of communicating potentially large queries to the machines where they will be evaluated. The solution that we propose is based on a simple view definition language that allows for automatic generation of views. The language maps paths in the view abstract DTD to paths in theconcrete source DTDs. It enablesa distributed implementation of the view system that is scalable both in termsof data and load. In particular, the query translation algorithm is shown to have a good (linear) complexity.
Serge Abiteboul, Tova Milo, Omar Benjelloun-
IEEE Transactions on Knowledge and Data Engineering 2002
The developments of XML and Web services (e.g. SOAP) are
changing radically the art of data integration. We briefly describe
Web services and consider some of their impact on data integration.
We argue that XML and Web services provide the proper infrastructure
for data integration at the Web scale. This is illustrated by some
work going on at INRIA on Active XML, that is XML extended by
allowing the embedding of calls to Web services.
Serge Abiteboul, Tova Milo, Omar Benjelloun-
IEEE Transactions on Knowledge and Data Engineering 2002
Nathalie Pernelle, Marie-Christine Rousset, Veronique Ventos, Henri Soldano-
IEEE Transactions on Knowledge and Data Engineering 2002 september - 14(2):157--187
This paper deals with the representation of multi-valued data by clustering them in a small number of classes organized in a hierarchy and described at an appropriate
level of abstraction. The contribution of this paper is three fold.
First we investigate a
partial order namely nesting, relating Galois lattices. A Nested Galois lattice is obtained by reducing (through projections) the original lattice.
As a consequence it makes coarser the equivalence relations defined on extents and intents.
Second, we investigate the intensional and extensional aspects of the languages used in our system Zoom. In particular we discuss the notion of alpha-extension of terms of a class language L.
We also present our most expressive language L3, close to a descritpion logic, and which expresses optionality or/and multi-valuation of attributes.
Finally, the nesting order between the Galois lattices corresponding to various languages and extensions is exploited in the interactive system ZooM.
Typically, a ZooM session starts from a propositional language L2 and a coarse view of the data (trough alpha-extension). Then, the user selects two ordered nodes in the lattice and ZooM constructs a fine-grained lattice between the antecedents of these nodes.
So, the general purpose of ZooM is to give a general view of concepts addressing a large data set, then focussing on part of this coarse taxonomy.
Serge Abiteboul, Tova Milo, Omar Benjelloun-
2001
We propose a peer-based architecture that allows for the integration
of distributed data and web services. It relies on a language, Active
XML, where (1) documents embed calls to web services that are used to
enrich them, and (2) new web services may be defined by XQuery queries
on such active documents.
Embedding calls to functions or even to web services inside data is
not a new idea. Our contribution, however, is in turning them into a
powerful tool for data and services integration. In particular, the
language includes linguistic features to control the time of service
call activations. Various scenarios are captured, such as mediation,
data warehousing, and distributed computation. A first prototype is
described.
Lucie Xyleme-
IEEE Transactions on Knowledge and Data Engineering 2001 - 24(2):40--47
Xyleme is a dynamic warehouse for XML data of the Web supporting query evaluation, change control and data integration. We briefly present our motivations, the general architecture and some aspects of Xyleme. The project we describe here was completed at the end of 2000. A prototype has been implemented. This prototype is now being turned into a product by a start-up company also called Xyleme.
Luc Segoufin, Michael Benedikt, Leonid Libkin, Thomas Schwentick-
IEEE Transactions on Knowledge and Data Engineering 2001
We study algebras of definable string relations-- classes of regular $n$-ary relations thatarise as the definable sets within amodel whose carrier is the set of all strings. We show that the largest such algebra -- the collectionof regular relations -- has some quite undesirablecomputational and model-theoretic properties. In contrast,we exhibit several definable relation algebras thathave much tamer behavior: for example, they admit quantifierelimination, and have finite VC dimension.We show thatthe properties of a definable relation algebra are notat all determined by the one-dimensional definable sets.We give models whose definable sets are all star-free, butwhose binary relations are quite complex, as well as models whose definable sets include all regular sets, but whichare much more restricted and tractable than the full algebraof regular relations.
Claude Delobel, Marie-Christine Rousset-
2001
We are interested in defining and querying large and highly heterogeneous database repository for XML data and their DTDs. We consider a
very simple data model with data trees and tree types for describing the data sources and a very simple query language based on tree queries
with boolean conditions. The main components of the data model are a mediated schema as a view of a set of concrete tree types and a mapping
expressing the connection between the mediated schema and the concrete tree types. We provide a general tree query semantics and a
characterization of tree queries as composition of branch queries. We show also that we can pre-evaluate queries by storing some specific
meta-information related to mapping information. We consider also the problem of query containment for tree queries and we propose some policies of relaxation for queries without answer. The research presented here was motivated by the Xyleme project at INRIA, whose objective is to develop a data warehouse for Web XML documents.
Serge Abiteboul, Bernd Amann, Laurent Mignet, Amelie Marian, Mihai Preda-
2001
We consider the acquisition and maintenance of XML data found on the web. More
precisely, we study the problem of discovering XML data on the web, i.e., in a
world still dominated by HTML, and keeping it up to date with the web as best
as possible, under set resources. We present a distributed architecture that
is designed to scale to the billions of pages of the web. In particular, the
distributed management of metadata about HTML and XML pages turns out to be an
interesting issue. The scheduling of the fetching of the page is guided by the
importance of pages, their expected change rate, and subscriptions/
publications of users. The importance of XML pages is defined in the standard
manner based on the link structure of the web graph. It is computed by a
matrix fixpoint computation. HTML pages are of interest for us only in that
they lead to XML pages. Thus their importance is defined in a different manner
and their computation also involves a fixpoint but on the transposed link
matrix this time. The general scheduling problem is stated as an optimization
problem that dispatches the resources to various tasks such as (re)reading HTML
pages for discovering new XML pages or refreshing already known XML pages.
This work is part of the Xyleme system. Xyleme is developed on a cluster of
PCs under Linux with Corba communications. The part of the system described in
this paper has been implemented. We mention first experiments.
Nathalie Pernelle, Marie-Christine Rousset, Veronique Ventos-
IEEE Transactions on Knowledge and Data Engineering 2001 :386-398
In many applications, it becomes crucial to help users to access to
a huge amount of data by clustering them in a small number of
classes described at an appropriate level of abstraction.
In this paper, we present an approach based on the use of two languages
of description of classes for the automatic clustering of multi-valued data.
The first language of classes has a high power of abstraction and guides the construction of a lattice of classes covering the whole set of the data.
The second language of classes, more expressive and more precise, is the basis for the refinement of a part of the lattice that the user wants to focus on.
Nathalie Pernelle, Marie-Christine Rousset, Veronique Ventos-
IEEE Transactions on Knowledge and Data Engineering 2001
In many applications, it becomes crucial to help users to access to a
huge
amount of data
by clustering them in a small number of classes described at an
appropriate
level of abstraction.
In this paper, we present an approach based on the use of two
languages of
description of classes
for the automatic clustering of semistructured data.
The first language of classes has a high power of abstraction and
guides the
construction
of a lattice of classes covering the whole set of the data.
The second language of classes, more expressive and more precise, is
the basis
for the
refinement of a part of the lattice that the user wants to focus on.
Our approach has been implemented and experimented on real data
in the setting of the GAEL project which aims at
building flexible electronic catalogs organized as a hierarchy of classes of
products. Our experiments have been conducted on real data coming from
the C/Net (http://www.cnet.com) electronic catalog of computer products.
Serge Abiteboul, Laurent Mignet, Amelie Marian, Gr Cobena-
IEEE Transactions on Knowledge and Data Engineering 2001
We present a change-centric method to manage versions in a WebWareHouse of XML data. By change-centric we mean that we focus on deltas, i.e., the changesthemselves, as opposed to other approaches that might focus on snapshots orobject history. We introduce a logical representation of changes based ondeltas and unique identifiers. We present a novel way of allocating persistentidentifiers to nodes. We analyze several implementation issues and comparedifferent storage policies. Based on some requirements (certainfunctionalities that we want to support efficiently), we motivate our choice ofa specific storage policy. We store the last (most current) version and adocument containing a list of ``completed'' forward deltas with someinformation on the persistent identifiers. This work is part of the Xylemesystem. Xyleme is developed on a cluster of PCs under Linux with Corbacommunications. The part of the system described in this paper has beenimplemented. We also mention first experiments.
Marie-Christine Rousset, Alexandre Termier, Mich Sebag-
IEEE Transactions on Knowledge and Data Engineering 2001
A new approach for constructing pseudo-keywords, referred to as Sense Units, is proposed. Sense Units are obtained by a word clustering process, where the underlying similarity reflects both statistical and semantic properties, respectively detected through Latent Semantic Analysis and WordNet. Sense Units are used to recode documents and are evaluated from the performance increase they permit in classification tasks. Experimental results show that accounting for semantic information in fact { decreases} the performances compared to LSI standalone. The main weaknesses of the current hybrid scheme are discussed and several tracks for improvement are sketched.
Serge Abiteboul, Tova Milo, Haim Kaplan-
IEEE Transactions on Knowledge and Data Engineering 2001
No abstract yet.
Pierangelo Veltri-
2001
We consider the problem of solving a large number of simple systems
of linear constraints. This problem occurs in the context of constraint
databases.
The developed methodology is based on a hierarchical evaluation of
the constraints which are first simplified, and replaced by approximations.
We focus on systems of linear constraints over the reals, which model spatial
objects, and consider both geometric and topological approximations, defined
with very simple constraints. We show that these constraints can be used either
to solve the initial systems, or at least to filter out unsatisfiable systems.
The main contribution of the paper
is the development of a set of rewriting rules, that allow the transformation
of spatial object queries into equivalent ones that make use of the approximation, reducing
the query evaluation cost.
Nathalie Pernelle, Marie-Christine Rousset, Veronique Ventos-
IEEE Transactions on Knowledge and Data Engineering 2001 - 1(1-2):81-92
In many applications, it becomes crucial to help users to access to a huge amount of data
by clustering them in a small number of classes described at an appropriate level of abstraction.
In this paper, we present an approach based on the use of two languages of description of classes
for the automatic clustering of semistructured data.
The first language of classes has a high power of abstraction and guides the construction
of a basic lattice of classes covering the whole set of the data.
The second language of classes, more expressive and more precise, is the basis for the
refinement of a part of the lattice that the user wants to focus on.
Laurent Simon, Alvaro del Val-
IEEE Transactions on Knowledge and Data Engineering 2001 :359--365
[SdV01]
. In , pages , Seattle, Washington, USA, 2001.
[ bib | .ps.gz | .pdf ]
We present an extensive experimental study of consequence-finding algorithms based on kernel resolution, using both a trie-based and a novel ZBDD-based implementation, which uses Zero-Suppressed Binary Decision Diagrams to concisely store and process very large clause sets. Our study considers both the full prime implicate task and applications of consequence-finding for restricted target languages in abduction, model-based diagnosis, and polynomially-bounded knowledge compilation. We show that the ZBDD implementation can push consequence-finding to a new limit, solving problems which generate over 1070 clauses.
Veronique Ventos, Pierre Brezellec, Henry Soldano-
2001
This work concerns the use of default knowledge in concept learning from
positive and negative examples. Two connectives are added to a
description logics, C-CLASSIC, previously defined for concept learning. The
new connectives ($\delta$ and $\epsilon$) allow to express the idea that
some
properties of a given concept definition are default properties, and that
some properties that should belong to the
concept definition actually do not (these are excepted properties). When
performing
concept learning both hypotheses
and examples are expressed in this new description logics but prior to
learning, a saturation process using default and non default rules has to
be applied to the examples in order to add
default and excepted properties to their definition.
As in the original C-CLASSIC, disjunctive learning is performed using a
standard greedy set covering
algorithm whose generalization operator is the Least Common Subsumer
operator of C-CLASSIC$_{\delta \epsilon}$. We exemplify concept
learning using default knowledge in this framework and show that
explicitly expressing default knowledge may result in simpler concept
definitions.
Chantal Reynaud, Jean Charlet, R Teulier.-
IEEE Transactions on Knowledge and Data Engineering 2001
L'Ingenierie des connaissances constitue depuis de nombreuses annees un domaine actif des recherches menees en intelligence artificielle autour de la conception et de la modelisation des systemes a base de connaissances (SBC). A l'instar de bien d'autres disciplines modelisatrices, elle consiste a concevoir des systemes dont le fonctionnement permet d'operationaliser des connaissances portant sur le traitement ou la resolution d'un probleme donne. La resolution (semi-)automatique de problemes implique deux etapes essentielles : la modelisation du probleme et d'une methode de resolution dans un cadre theorique donne l'operationalisation informatique du modele obtenu.
Catriel Beeri, Bernd Amann, Michel Scholl, Anne-Marie Vercoustre, Irini Fundulaki-
2001
Serge Abiteboul, Mihai Preda, Gr Cobena, Benjamin Nguyen-
IEEE Transactions on Knowledge and Data Engineering 2001 July
We consider the monitoring of a flow of incoming documents. More
precisely, we present here the monitoring used in a very large
warehouse built from XML documents found on the web. The flow of
documents consists in XML pages (that are warehoused) and HTML pages
(that are not). Our contributions are the following:
- a subscription language which specifies the monitoring of
pages when fetched, the periodical evaluation of continuous queries and
the production of XML reports.
- the description of the architecture of the system we implemented
that makes it possible to monitor a flow of millions of pages per day with millions of subscriptions on a single PC, and scales up by using more machines.
- a new algorithm for processing alerts that can be used in a
wider context.
We support monitoring at the page level (e.g., discovery of a new page
within a certain semantic domain) as well as at the element level
(e.g., insertion of a new electronic product in a catalog).
This work is part of the Xyleme system. Xyleme is
developed on a cluster of PCs under Linux with Corba communications.
The part of the system described in this paper has been implemented.
We also mention first experiments.
Keywords: XML, HTML, Change Control, Monitoring, Continuous Query,
Warehouse, Web, Subscription.
Philippe Chatalic, Laurent Simon-
IEEE Transactions on Knowledge and Data Engineering 2001 - 10(4):451--481
This paper presents a system based on new operators for handling sets of propositional clauses compactly represented by means of ZBDDs. The high compression power of such data structures allows efficient encodings of structured instances. A specialized operator for the distribution of sets of clauses is introduced and used for performing multiresolution on clause sets. Cut eliminations between sets of clauses of exponential size may then be performed using polynomial size data structures. The textsc{ZRes} system, a new implementation
of the Davis-Putnam procedure of 1960, solves two hard problems for resolution, that are currently out of the scope of the best SAT provers.
Fanny Wattez-
2001
This thesis studies how to optimize queries over data whose structure
is a tree.
On the one hand, data are stored in the OODBMS O$_{2}$.
While running a benchmark of OQL queries, we emphasize that this
system does not cope well with large associative accesses and we
propose solutions to improve performances. Moreover, we show the pros
and cons of navigation versus joins in a real system.
On the other hand, data are XML documents, i.e., semi-structured data.
In this context, optimization techniques are different. In order to
answer quickly structured queries over a huge number of documents,
which technique can we use? We have chosen to index documents in
full-text indexes. We explain how we extend this technology and how,
once documents are indexed, queries are evaluated.
Sophie Cluet, Vincent Aguilera, Fr Boiscuvier-
2001
In this paper, we focus on the implementation of
a special purpose algebraic operator, called
PatternScan, using the full text index technology.
Sophie Cluet, Vincent Aguilera, Fanny Wattez, Pierangelo Veltri, Dan Vodislav-
2001
XML is becoming a standard to represent and exchange structured data over
the Web. Undoubtedly, more and more
data on the Web will be in XML form, as more and more DTDs will
be available to describe it. Given
this, the Xyleme project proposes to study and build a dynamic World Wide
XML distributed data warehouse, capable of storing a very large
number of XML data available on the Web. Among the many problems addressed by the Xyleme
project, in this paper, we focus on query processing in Xyleme.
In particular: (i) how to query an integrated view, (ii) how documents are
indexed in the repository, and (iii) how the algebraic plans are distributed
and evaluated.
Serge Abiteboul, Victor Vianu, Luc Segoufin-
IEEE Transactions on Knowledge and Data Engineering 2001
We study the representation and querying of XML with incomplete information. We consider a simple model for XMLdata and their DTDs, a very simple query language, and a representation system for incomplete information in the spiritof the representations systems developed by Imielinski and Lipski for relational databases. In the scenario we consider, theincomplete information about an XML document is continuously enriched by successive queries to the document. Weshow that our representation system can represent partial information about the source document acquired by successivequeries, and that it can be used to intelligently answer new queries. We also consider the impact on complexity ofenriching our representation system or query language with additional features. The results suggest that our approachachieves a practically appealing balance between expressiveness and tractability. The research presented here wasmotivated by the Xyleme project at INRIA, whose objective it to develop a data warehouse for Web XML documents.
Catriel Beeri, Bernd Amann, Michel Scholl, Anne-Marie Vercoustre, Irini Fundulaki-
2001
A Community Web portal is a set of tools for a community
of people who want to share information on a certain domain via the
Web. The backbone of this system is an ontology which is
understood by all members of the community and used as the common
interface component for integrating, querying, structuring and
exchanging information. This paper describes a simple but
nevertheless powerful language for integrating XML documents into a
Community Web system. More precisely, we describe how to add and
exploit XPath enabled Web servers by mapping standard XPath location
paths to conceptual paths in the system ontology. We present a query
rewriting algorithm using such mappings. It transforms a user query
into a set of XML queries based on XPath patterns for selecting XML
fragments. Finally, we describe some evaluation techniques for
reducing the size of XML data returned by the XML source for query
evaluation.
Philippe Chatalic, Laurent Simon-
IEEE Transactions on Knowledge and Data Engineering 2001 June:259--270
SatEx est un site web orient
Philippe Chatalic, Laurent Simon-
IEEE Transactions on Knowledge and Data Engineering 2001 June
SatEx is a web site devoted to SAT experimentation. It is not only a front end to a database gathering an exhaustive number of executions, but it also allows dynamic results synthesis as well as detailed explorations of experimentation results. Being dynamically generated and constantly updated and improved, this site can be considered as an almost always up-to-date SAT experimentation paper. To the current time, SatEx presents the results of more than 450 CPU days on a recent machine. In a few months, this site has been well received by the SAT community and has reached more than 20000 hits. SatEx site is available at http://www.lri.fr/~simon/satex/satex.php3.
Laurent Simon-
IEEE Transactions on Knowledge and Data Engineering 2001
The SatEx web site is devoted to experimentation around SAT. It is not only a front-end to a database gathering an exhaustive number of executions, but it also allows dynamic result synthesis as well as a detailed exploitation of all experimentations. Being dynamically generated and constantly evolving, this site can be considered as an always up-to-date SAT experimentation paper. At this time, the first realease of the SatEx http://www.lri.fr/~simon/satex/satex.php3 is already used by the SAT community, with more than 15000 hits in a few months. The new release will present the results of more than 450 cpu-days on a recent machine and will incorporate some major improvements.
Dan Vodislav, Chantal Reynaud, Jean-Pierre Sirot-
IEEE Transactions on Knowledge and Data Engineering 2001 Juillet
With the current explosion of data, retrieving and integrating
information from various sources is a critical problem. The designer has
to specify a mediated schema providing a homogeneous view on the sources.
In this paper, we report on an initial work toward automatically generating mappings
between elements in the sources and in the mediated schema. Information sources
we are interested in are XML documents in respect with a Document Type Definition (DTD). We describe the Xyleme project, which is the context of this study.We present our approach implemented in the SAMAG system to automatically find mappings on the basis of semantic and structural criteria. Finally, we report the first results of an experiment where SAMAG has been applied to XML documents in the cultural domain.
Luc Segoufin, Michael Benedikt, Leonid Libkin, Thomas Schwentick-
IEEE Transactions on Knowledge and Data Engineering 2001
We study relational calculi with support for string operations. Most prior proposals werebased on adding the operation of concatenation to first-order logic. Such an extension isproblematic as the relational calculus becomes computationally complete, which in turnimplies strong limits on the ability to perform optimization and static analysis ofproperties such as query safety. In contrast, we look at extensions of relational calculusthat have nice expressiveness, decidability, and safety properties, while corresponding tosets of string operations used in SQL. We start with an extension based on the stringordering and LIKE predicates. We then extend this basic model to include string lengthcomparison. While both of these share some of the attractive properties of relationalcalculus (low data complexity for generic queries, effective syntax for safe queries,correspondence with an algebra), there is a large gap between these calculi in expressivepower and complexity. The smaller `basic model' has low data complexity for all queries,but is quite limited in expressive power. The latter handles many natural stringoperations, but includes queries with high data complexity. We thus also explore thespace between these two extremes. We consider two intermediate languages: the firstextends our base language with functions that trim/add leading characters, and the otherextends it by adding the full power of regular-expression pattern matching. We showthat both these extensions inherit many of the attractive properties of the basic model:they both have corresponding algebras expressing safe queries, and low complexity ofquery evaluation.
Serge Abiteboul, Tova Milo, Omar Benjelloun-
2001
We present a data-oriented approach to web services integration. To this end, we introduce a new framework based on a model that embeds service calls within semistructured data. The framework captures various integration scenarios including mediation and warehousing, and provides fusion based on object identifiers. Moreover, by allowing service call parameters and responses to contain calls to other services, the framework enables distributed computation over the web. This report describes our ongoing work on the subject, mainly through examples.
Chantal Reynaud, Brigitte Safar-
IEEE Transactions on Knowledge and Data Engineering 2001 Juin:329-348
An information integration system provides a uniform query interface to a collection of autonomous and distributed sources, connected to each other thanks to a global mediated schema, called domain ontology.The problem addressed in the paper is how to represent such an ontology into CARIN-ALN, a formalism combining classes and rules. We focus on the choices for representing classes, properties and constraints using the characteristics of the formalism. We also propose a method in two steps for representing a domain ontology in the framework of a mediator. The first step is directed by the formalism and the functionalities of the mediator. The second step is an optimization phase guided by the way functionalities of the mediator are implemented.
Sophie Cluet, Pierangelo Veltri, Dan Vodislav-
2001
We are interested in maintaining and querying views in a huge and
highly heterogeneous XML repository (Web scale). In this context,
views are very large and there is no apparent limitation to their size.
This raises interesting problems that we address in the paper: (i) how to
distribute views over several machines without having a negative impact
on the query translation process; (ii) how to quickly select the relevant
part of a view given a query; (iii) how to minimize the cost of communicating
potentially large queries to the machines where they will be evaluated.
The presented system of views is implemented in the Xyleme system.
Luc Segoufin, Martin Grohe, Thomas Schwentick-
IEEE Transactions on Knowledge and Data Engineering 2001
The evaluation of conjunctive queries is hard both with respect to itscombined complexity (NP-complete) and its parameterized complexity (W[1]-complete). It becomes tractable (PTIME for combined complexity, FPT for parameterized complexity), when conjunctive queries have a bounded tree-width structure. In this paper we show that, in a sense, this is optimal both with respect to combined and parameterized complexity : Evaluating conjunctive queries is tractable precisely for those classes of queries that have a bounded tree-width structure. A technical result of independent interest is that colored grid homomorphism problem is NP-complete and, if parameterized by the grid size, WI-complete.
Laurent Mignet, Vincent Aguilera, Sebastien Ailleret, Pierangelo Veltri-
2001
In this paper we address the problem of loading data from the
web. We present the architecture of Xyro, a crawler designed to
fetch data from the Web, and particularly XML data. We
describe our experience in designing and implementing this
robot, addressing some standard problems in the crawler
implementation, and presenting our way to overcome in some of
them.
Bernd Amann, Michel Scholl, Vassilis Christophides, Anne-Marie Vercoustre, Irini Fundulaki, Sofia Alexaki, Greg Karvounarakis, Karsten Tolle-
2000
Marie-Christine Rousset, Bernard Ycart-
2000
We consider the set of sentences in a decidable fragment of first order logic, restricted to unary and binary predicates, expressed on a finite set of bjects. Probability distributions on that set are defined and studied. For large sets of objects, is is shown that under these distributions, random entences typically have a very particular form, and that all monotone symmetric properties of these random sentences have a probability close to 0 or 1.
Stephane Grumbach, Leonardo Tininini-
2000
The paper presents a logical data model for statistical data with an
explicit modeling of metadata, which allows to perform automatic
aggregation. The data are stored in standard relations from the
relational model, while the metadata, defining the semantics of the
relations, are represented by numerical dependencies which specify the
way the summary values are defined in terms of micro-data, as well as
the interrelationships among summary values. The present model
supports standard relational languages such as SQL. Relations with
numerical dependencies are then seen as {it statistical views} over
initial relations of micro-data. Queries can be asked either against
the views or directly against the initial relations, and in this later
case answered, when possible, using the views. The numerical
dependencies of the views are run together with the query to compute
the answer to the query. This is handled in a completely automatic
manner, with no need for the user to deal with the intricacy of
metadata. The mechanism has been tested by an implementation in
Prolog of meaningful examples of queries and dependencies. It is shown
in particular that various classical problems in the realm of
statistical and multidimensional databases can be easily modeled and
solved in the present framework. Finally, the proposed formalism is
shown to be useful for statistical database schema design.
Sophie Cluet, Fanny Wattez, Veronique Benzaken, Guy Ferran, Christian Fiegel-
2000
In October 1997, we started what we thought would be a comprehensive
benchmarking of OQL~cite{Catt97,Clue98} queries. Two years of hard work and many blunders later, we have tested one tenth of what we were shooting for.
The first part of this paper should be useful to would-be
benchmarkers: it explains how we could have avoided losing
time. In the second part, we show that our few results are in fact extremely
informative. Notably, we understand
now why O$_2$ does not cope well with large
associative accesses requiring duplicate I/Os and
how this could be seriously improved. Also, we have a better
grasp on the pros and cons of navigation versus joins in a real
system and believe we know how to efficiently evaluate queries over
hierarchical structures.
Chantal Reynaud-
IEEE Transactions on Knowledge and Data Engineering 2000
Cet article porte sur la construction automatique de modeles de resolution de problemes a partir d'ontologies du domaine specifiées formellement. Il est plus particulierement centre sur les aspects reutilisation. L'objectif est de montrer que l'approche adoptee est differente des approches de reutilisation habituelles qui reposent sur la reutilisation de modeles generiques a affiner et a adapter. Au contraire, nous proposons de construire un modele de raisonnement d'une application par assemblage de composants elementaires generiques predefinis. Une organisation de bibliotheques de tels composants est decrite. Cette approche a ete implementee dans ASTREE. Elle est basee sur une mise en correspondance de connaissances specifiees dans des taches et des procedures avec des connaissances d'une ontologie du domaine. Ces mises en correspondance sont facilitees par la syntaxe precise et la semantique non ambigue du langage de modelisation de l'ontologie.
Brigitte Safar, Alain Bidault, Christine Froidevaux-
IEEE Transactions on Knowledge and Data Engineering 2000
In this paper, we study failing queries posed to a
mediator in an information integration system and
expressed in the logical formalism of the information
integration system PICSEL. First, we present the
notion of concept generalisation in a concept hierarchy
that is used to repair failing queries. Then, we address
two problems arising while rewritting a query using
views. The first problem concerns queries that cannot be rewritten
due to a lack of sources, the second one concerns
queries that have only unsatisfiable rewritings.
Stephane Grumbach, Philippe Rigaux, Pierangelo Veltri-
2000
We consider the problem of solving a {it large} number of {it simple} systems
of constraints. This problem occurs in the context of databases with very large
collections of data while the constraints are over a small number of
variables. The methodology we develop is based on a hierarchical evaluation of
the constraints which are first simplified, and replaced by constraints
approximating the initial ones.
We focus on systems of linear constraints over the reals, which model spatial
objects, and consider both geometric and topological approximations, defined
with very simple constraints. We show that these constraints can be used either
to solve the initial systems, or at least to filter out unsatisfiable systems.
More generally, we consider the manipulation of the spatial objects with
first-order queries. We show how the queries can be evaluated by taking
advantage of the approximations. In general, it is undecidable if a query can
be completely answered on approximated data. The main contribution of the paper
is the development of a set of rewriting rules, that allow the transformation
of queries into equivalent ones that make use of the approximation. As usual
with optimization problems, an optimal solution is out of reach, but the
rules are optimal for individual operations of relational algebra.
The efficiency of the technique is important for the range of applications
considered, where the probability of filtering from the approximated data is
very high. The technique is illustrated on a practical example.
Bernd Amann, Michel Scholl, Irini Fundulaki-
2000
In this paper we present a new approach for building metadata
schemas by integrating existing ontologies and structured
vocabularies (thesauri). This integration is based on the
specification of inclusion relationships between thesaurus terms and
ontology concepts and results in application-specific metadata
schemas incorporating the structural views of ontologies and the
deep classification schemes provided by thesauri. We will also show
how the result of this integration can be used for RDF schema
creation and metadata querying. In our context, (metadata) queries
exploit the inclusion semantics of term relationships, which
introduces some recursion. We will present a fairly simple
database-oriented solution for querying such metadata which avoids a
(recursive) tree traversal and is based on a linear encoding of
thesaurus hierarchies.
Sophie Cluet-
2000
Le Web est une banque de donn
Stephane Grumbach, Philippe Rigaux, Luc Segoufin-
2000
Data defined by interpolation is frequently found in new applications
involving, for instance, geographical concepts, moving objects, and
spatio-temporal data. This data leads to potentially infinite collections of
items, (e.g. the elevation of any point in a map), whose definition is based
on the association of a collection of samples with an interpolation
function. We first argue that the manipulation of the data through direct
access to the samples and interpolation functions easily leads
to cumbersome or inaccurate queries. We therefore suggest hiding the
samples and the interpolation function away from the logical level, and
letting the system manipulate them at the physical level.
We propose to model such data conceptually using infinite relations (e.g.
the map with elevation yields an infinite ternary relation) which can be
manipulated through standard relational query languages (e.g. SQL), with no
mention of the interpolated definition. This approach is simple and
establishes a clear separation between logical and physical levels. It
requires standard relational query languages whose semantics is
well-defined. Moreover, we show that the evaluation of queries, which
includes the computation of the sampling collection and the interpolation
function of the output, can be done very efficiently, and we describe the
physical level algorithms. We show that
this approach can be easily supported by standard database techniques, and
demonstrate this with a prototype.
Philippe Chatalic, Laurent Simon-
IEEE Transactions on Knowledge and Data Engineering 2000 November:2--10
This paper presents a system based on new operators for handling sets of propositional clauses represented by means of ZBDDs. The high compression power of such data structures allows efficient encodings of structured instances. A specialized operator for the distribution of sets of clauses is introduced and used for performing multi-resolution on clause sets. Cut eliminations between sets of clauses of exponential size may then be performed using polynomial size data structures. The ZRes system, a new implementation of the Davis-Putnam procedure of 1960, solves two hard problems for resolution, that are currently out of the scope of the best SAT provers.
Stephane Grumbach, Leonardo Tininini-
2000
We consider the problem of answering queries using only materialized
views. We first show that if the views subsume the query from the point of
view
of the information content, then the query can be answered using only the
views, but the resulting query might be extremely inefficient. We then
focus
on aggregate views and queries over a single relation, which are fundamental
in
many applications such as data warehousing. We show that in this case, it
is
possible to guarantee that as soon as the views subsume the query, it can be
completely rewritten in terms of the views in a simple query language. Our
main contribution is the conception of various rewriting algorithms which
run
in polynomial time, and the proof of their completeness which relies on
combinatorial arguments. Finally, we discuss the choice of materializing or
not ratio views such as average and percentage, important for the design of
materialized views. We show that it has an impact on the information
content,
which can be used to protect data, as well as on the maintenance of views.
Sophie Cluet, Vassilis Christophides, Jerome Simeon-
2000
Modern applications (portals, e-commerce, digital libraries, etc.)
require integrated access to various information sources (from
traditional RDBMS to semistructured Web repositories), fast
deployment and low maintenance cost in a rapidly evolving
environment. Because of its flexibility, there is an increasing
interest in using XML as a middleware model for such applications.
XML enables fast wrapping and declarative integration. However,
query processing in XML-based integration systems is still penalized
by the lack of an algebra with adequate optimization properties and
the difficulty to understand source query capabilities. In this
paper, we propose an algebraic approach to support efficient query
evaluation in XML integration systems. We define a general purpose
algebra suitable for semistructured or XML query languages. We show
how this algebra can be used, with appropriate type information, to
also wrap more structured query languages such as OQL or SQL.
Finally, we develop appropriate optimization techniques.
Victor Vianu, Luc Segoufin-
IEEE Transactions on Knowledge and Data Engineering 2000 - 61(2):270-301
The paper investigates the use of topological annotations(called topological invariants)to answer topological queries in spatial databases.The focus is on the translation of topological queries against thespatial database into queries against the topological invariant.The languages considered are first-order on the spatial database side,and fixpoint counting, fixpoint, and first-order on the topological invariant side.In particular, it is shown that fixpoint countingexpresses precisely all the ptime queries on topological invariants;if the regions are connected, fixpoint expresses all ptime querieson topological invariants.
Sophie Cluet, Vincent Aguilera, Fanny Wattez, Pierangelo Veltri, Dan Vodislav-
2000
XML is a standard to exchange structured data over
the Web. It is being widely adopted. Undoubtedly, more and more
data on the Web will be in XML form, as more and more DTDs will
be available to describe it. Given
this, the Xyleme project proposes to study and build a dynamic World Wide
XML warehouse, i.e., a data warehouse capable of storing all the
XML data available on the planet.
Among the many problems addressed by the Xyleme team, we focus in
this paper on query processing.
Brigitte Safar, Alain Bidault-
IEEE Transactions on Knowledge and Data Engineering 2000 - 1:225-235
Un m
Philippe Chatalic, Laurent Simon-
IEEE Transactions on Knowledge and Data Engineering 2000 june:107--118
Cet article pr
Luc Segoufin, Michael Benedikt, Martin Grohe, Leonid Libkin-
2000
It is known that standard query languages for constraint databases
lack the power to express connectivity properties. Such properties are
important in the context of geographical databases, where one
naturally wishes to ask queries about connectivity (what are the
connected components of a given set?) or reachability (is there a path
from A to B that lies entirely in a given region?). No existing
constraint query languages that allow closed form evaluation can
express these properties.
In the first part of the paper, we show that in principle there is no
obstacle to getting closed languages that can express connectivity and
reachability queries. In fact, we show that adding any topological
property to standard languages like FO(<.+) and FO(<,+,*) results in a
closed language. In the second part of the paper, we look for
tractable closed languages for expressing reachability and
connectivity queries. We introduce path logic, which allows one
to state properties of paths with respect to given regions. We show
that it is closed, has polynomial time data complexity for linear and
polynomial constraints, and can express a large number of reachability
properties beyond simple connectivity. Query evaluation in the logic
involves obtaining a discrete abstraction of a
continuous path, and model-checking of temporal formulae on the
discrete structure.
Serge Abiteboul, Victor Vianu, Yelena Yesha, Brad S. Fordham-
2000
Electronic commerce is emerging as one of the major Web-supported applications
requiring database support. We introduce and study high-level
declarative specifications of business models, using an approach
in the spirit of active databases.
More precisely, business models are specified as relational
transducers that map sequences of input relations into
sequences of output relations.
The semantically meaningful trace of
an input-output exchange is kept as a sequence of log relations.
We consider problems motivated by electronic commerce applications,
such as log validation,
verifying temporal properties of transducers,
and comparing two relational transducers. Positive results are
obtained for a restricted
class of relational transducers called Spocus transducers
(for semi-positive outputs and cumulative state). We argue that
despite the restrictions, these
capture a wide range of practically significant business models.
Brigitte Safar, Alain Bidault, Christine Froidevaux-
IEEE Transactions on Knowledge and Data Engineering 2000
In this paper, we study unsatisfiable queries posed to a mediator
in an information integration system and expressed in a logical formalism.
First, we characterise conflicts as the minimal causes of the unsatisfiability
of a query. Then, thanks to our cooperative query answering process, we produce its set of
repairs. A repair is a query
that does not generate any conflict and that has a common generalisation
with the initial query and is semantically close to it.
Marie-Christine Rousset, Francois Goasdou-
2000
In database, rewriting queries using views has received significant attention because of its relevance to several fields such as query optimization,
data warehousing, and information integration. In those settings, data used to answer a query are restricted to be extensions of a set of
predefined queries (views). The information integration context is typical of the need of rewriting queries using views for answering queries.
Users of information integration systems do not pose queries directly to the (possibly remote) sources in which data are stored but to a set of
virtual relations that have been designed to provide a uniform and homogeneous access to a domain of interest. When the contents of the sources
are described as views over the (virtual) domain relations, the problem of reformulating a user query into a query that refers directly to the
relevant sources becomes a problem of rewriting queries using views. In this paper, we study the problem of rewriting conjunctive queries over
DL expressions into conjunctive queries using a set of views that are a set of distinguished DL expressions, for three DLs allowing existential restrictions: FLE, ALE and ALEN. For FLE, we present an algorithm that computes a representative set of all the rewritings of a query. In the full version of this paper (cf. Technical report), we show how to adapt it to deal with the negation of atomic concepts in the queries and in the views, in order to obtain a rewriting algorithm for ALE. Finally, we show that obtaining a finite set representative of all the rewritings of a query
is not guaranteed in ALEN.
Serge Abiteboul, Bogdan Marinoiu, Pierre Bourhis-
IEEE Transactions on Knowledge and Data Engineering 2000
Marie-Christine Rousset, Francois Goasdou, V Latt-
2000
Picsel is an information integration system over sources that are distributed and possibly heteregeneous. The approach which has been chosen in
Picsel is to define an information server as a knowledge-based mediator in which Carin is used as the core logical formalism to represent both the domain of application and the contents of information sources relevant to that domain. In this paper, we describe the way the expressive
power of the Carin language is exploited in the Picsel information integration system, while maintaining the decidability of query answering. We illustrate it on examples coming from the tourism domain, which is the first real case that we have to consider in Picsel, in collaboration with France Telecom (CNET) and the travel agency Degriftour.
Ioana Manolescu, Nicolaas Ruberg, Gabriela Ruberg-
IEEE Transactions on Knowledge and Data Engineering 2000
INRIA Research Report no. 5222
Veronique Ventos, Celine Rouveirol-
2000
In this paper we investigate a new language
for learning, which combines two well-known representation
formalisms, Description Logics and Horn Clause Logics. Our
goal is to study the feasability of learning in such a hybrid
description - horn clause language, namely CARIN-ALN
\cite{MCAI98a}, in the presence of hybrid background
knowledge, including a Horn clause and a terminological
component. After setting our learning framework, we present
algorithms for testing example coverage and subsumption
between two hypotheses, based on the existential entailment
algorithm studied in\cite{MCAI98a}. While the hybrid language
is more expressive than horn clause logics alone, the
complexity of these two steps for CARIN-ALN remains
bounded by their respective complexity in horn clause logics.
Philippe Chatalic, Laurent Simon-
IEEE Transactions on Knowledge and Data Engineering 2000 - 1831:449--454
ZRes is a propositional prover based on the original procedure of Davis and Putnam, as opposed to its modified version of Davis, Logeman and Loveland, on which most of the current efficient SAT provers are based. On some highly structured SAT instances, such as the well known Pigeon Hole and Urquhart problems, both proved hard for resolution, ZRes performs very well and surpasses all classical SAT provers by an order of magnitude.
Bernd Amann, Irini Fundulaki-
1999
In this paper we present a new approach for building RDF schemas by
integrating existing ontologies and structured vocabularies
(thesauri). We will present a simple mechanism based on the
specification of inclusion relationships between thesaurus terms and
ontology concepts and show how these relationships can be exploited to
create application-specific RDF schemas incorporating the structural
views of ontologies and deep classification schemes provided by
thesauri.
Serge Abiteboul, Bernd Amann, Sophie Cluet, Tova Milo, Anat Eyal, Laurent Mignet-
1999
Electronic commerce is emerging as a major Web-supported
application. In this paper we argue that database technology can,
and should, provide the backbone for a wide range of such
applications. More precisely, we present here the ActiveViews
system, which, relying on an extensive use of database features
including views, active rules (triggers), and enhanced mechanisms
for notification, access control and logging/tracing of users
activities, provides the needed basis for electronic commerce.
Based on the emerging XML standards (DOM, query languages for XML,
etc.), the system offers a novel declarative view specification
language, describing the relevant data and activities of all actors
(e.g. vendors and clients) participating in electronic commerce
activities. Then, acting as an application generator, the system
generates an actual, possibly customized, Web application that
allows users to perform the given set of controlled activities and
to work interactively on the specified data in a standard
distributed environment
The ActiveView system is developed at INRIA on top of Java and
ArdentSoftware's XML repository
Stephane Grumbach, Tova Milo-
1999
We study languages for manipulating partially ordered structures with
duplicates (e.g. trees, lists). As a general framework, we consider
the pomset (partially ordered multiset) datatype. We introduce an
algebra for pomsets, which generalizes traditional algebras for
(nested) sets, bags and lists. This paper is motivated by the study of
the impact of different language primitives on the expressive
power. We show that the use of partially ordered types increases the
expressive power significantly. Surprisingly, it turns out that the
algebra when restricted to both unordered (bags) and totally ordered
(lists) intermediate types, yields the same expressive power as
fixpoint logic with counting on relational databases. It therefore
constitutes a rather robust class of relational queries. On the other
hand, we obtain a characterization of PTIME queries on lists by
considering only totally ordered types.
Bouthinon, D., H. Soldano and V. Ventos-
IEEE Transactions on Knowledge and Data Engineering 1999 :145-152
Sophie Cluet-
1999
Apr`es avoir longtemps v'ecu en autarcie, les syst`emes
d'information s'ouvrent vers le r'eseau. L''emergence du Web au
d'ebut de cette derni`ere d'ecennie a certainement accentu'e
l'ampleur d'un mouvement que le d'eveloppement des r'eseaux,
l'accroissement des donn'ees
num'eris'ees et la diversit'e des traitements rendaient
in'eluctable. Aujourd'hui, les syst`emes informatiques
g`erent, importent, diffusent, 'echangent, int`egrent de gros
volumes de donn'ees parfois r'epliqu'ees, souvent dans des
formats diff'erents.
Les syst`emes de gestion de bases de donn'ees (SGBD) permettent
la gestion efficace et coh'erente de gros volumes de donn'ees,
interrog'ees et mises `a jour de fac{c}on concurrente par plusieurs
utilisateurs. Ils sont donc bien 'evidemment au coe ur du
probl`eme. Conc{c}us `a l'origine comme des syst`emes
ferm'es pour des donn'ees
r'eguli`eres et des applications de gestion,
ils 'evoluent pour devenir la brique de base de nouvelles
applications ouvertes sur le r'eseau
telles que librairies num'eriques,
syst`emes Intra/Internet, serveurs de Web, commerce 'electronique, etc.
Ces applications sont `a l'origine d'une mine de nouveaux
probl`emes pour les bases de donn'ees. Cette th`ese traite
plusieurs de ces probl`emes. Elle s'articule autour d'une id'ee
forte et de son corollaire~: l'utilisation de langages d'eclaratifs et leur
optimisation.
Serge Abiteboul, Oliver Duschka-
IEEE Transactions on Knowledge and Data Engineering 1999
We study the complexity of the problem of answering queries using materialized views. This problem has attracted a lot of attention recently because of its relevance in data integration. Previous work considered only conjunctive view definitions. We examine the consequences of allowing more expressive view definition languages. The languages we consider for view definitions and user queries are: conjunctive queries with inequality, positive queries, datalog, and first-order logic. We show that the complexity of the problem depends on whether views are assumed to store all the tuples that satisfy the view definition, or only a subset of it. Finally, we apply the results to the view consistency and view self-maintainability problems which arise in data warehousing.
Philippe Chatalic, Laurent Simon-
IEEE Transactions on Knowledge and Data Engineering 1999 :9--16
Cet article pr
Daniel Chan-
1999
It is one of the distinguishing characteristics of workflow
management systems to integrate automated processes with
human processes in an application system. The way a human
process is modelled in a workflow system is by capturing
the significant events of the process. These events include
notifying the user who is assigned to perform the process,
disseminating information that is necessary for the user to
carry out the process, and collecting information that is
generated as the result of the user process. Such events
are often presented to the user or captured from the user
using forms.
This paper presents a form management system that is part
of the Vicomte workflow management system. The
major components include a form specification language,
a facility for dynamically generating HTML pages with client-side
validation to be used over the internet, as well as processing
modules for the generated HTML pages on the server side.
The architecture of the form management system is rather
general and therefore is applicable for many systems
requiring an effective form processing facility.
Stephane Grumbach, Gianni Mecca-
1999
We study the problem of rediscovering the schema of nested relations
that have been encoded as strings for storage purposes. We consider
various classes of encoding functions, and consider the mark-up
encodings, which allow to find the schema without knowledge of the
encoding function, under reasonable assumptions on the input data.
Depending upon the encoding of empty sets, we propose two polynomial
on-line algorithms (with different buffer size) solving the schema
finding problem. We also prove that with a high probability, both
algorithms find the schema after examining a fixed number of tuples,
thus leading in practice to a linear time behavior with respect to
the database size for wrapping the data. Finally, we show that the
proposed techniques are well-suited for practical applications, such
as structuring and wrapping HTML pages and Web sites.
Jerome Simeon-
1999
We design novel techniques and tools to simplify the exploitation of
heterogeneous data sources. So far, two main approaches have been
competing. Structured integration systems take advantage of database
techniques to offer efficient query evaluation over large sources.
However, they lack the flexibility required in a truly heterogeneous
environment. Semistructured systems rely on schema-less data models in
order to provide simple and fast integration over any set of
sources. Yet, the absence of a schema entails an increased complexity
to interpret the data. It is also a major disadvantage during query
evaluation.
We introduce a type system for the semistructured data model and we
define YATL, a declarative integration language. YATL is able to
resolve structural conflicts between sources and features high-level
primitives for the manipulation of collections and references. Type
information serves as a guide during the specification of integration
programs and enables verifications procedures through
type-checking. Moreover, YATL allows to preserve as much flexibility
as possible by exploiting genericity properties. The YAT system is a
prototype which implements these tools. Its architecture is designed
to minimize the work required for the integration process. It has been
used successfully to build various applications.
We propose an operational model suitable to describe the evaluation of
YATL programs. It contains an algebra for semistructured data, whose
equivalences allow us to recover all the major standard optimization
techniques. It also opens some new optimization opportunities in the
semistructured context. Finally, a precise description of source query
capabilities within the operational model can be used to obtain an
efficient distributed evaluation.
Chantal Reynaud, Francois Goasdou-
IEEE Transactions on Knowledge and Data Engineering 1999 May:121--138
The aim of this paper is to present an approach and automated tools for designing knowledge bases describing the contents of information sources in PICSEL knowledge based mediators. We address two problems: the abstraction problem and the representation problem, when information sources are relational databases. In a first part, we present an architectural overview of the PICSEL mediator showing its main knowledge components. Then, we describe our approach and tools that we have implemented to indentify, by means of an abstraction process, the main relevant concepts, called semantic concepts, in an Entity Relationship model and to help representing these concepts using CARIN, a logical language combining description logics and Datalog Rules, and using specific terms in the application domain model.
Stephane Grumbach, Philippe Rigaux, Luc Segoufin-
1999
The optimization of queries over constraint databases is a
prerequisite to the use of this methodology for data intensive
applications. In this paper, we present some preliminary
results, together with promising directions of research to address this fundamen
tal issue.
Stephane Grumbach, Philippe Rigaux, Luc Segoufin-
1999
One of the most important advantages of constraint databases is their ability
to represent and to manipulate data in arbitrary dimension within a uniform
framework. Although the complexity of querying such databases by standard
means such as first-order queries has been shown to be tractable for reasonable
constraints (e.g. polynomial), it depends badly (roughly speaking
exponentially) upon the dimension of the data. A precise analysis of the
trade-off between the dimension of the input data and the complexity of the
queries reveals that the complexity strongly depends upon the
use the input makes of
its dimensions. We introduce the concept of orthographic dimension, which,
for a convex object $O$, corresponds to the dimension of the (component)
objects $O_1,...,O_n$, such that $O=O_1 imescdots imes O_n$.
We study properties of databases with bounded orthographic dimension in a genera
l
setting of o-minimal structures, and provide a syntactic characterization of
first-order orthographic dimension preserving queries.
The main result of the paper concerns linear constraint databases. We prove
that orthographic dimension preserving Boolean combination of conjunctive
queries can be evaluated independently of the global dimension, with operators
limited to the orthographic dimension, in parallel on the components. This
results in an extremely efficient optimization mechanism, very easy to use in
practical applications.
Serge Abiteboul, J. Widom, T. Lahiri-
1999
In this paper, we extend the standard object database model ODMG and
the query language OQL with the ability to handle semistructured data
as found, for instance, in XML documents. In this extension, a very
structured portion of the data, e.g., a relation or a standard ODMG
object, may contain entrypoints to semistructured portions of the data
and vice versa. The management of such ``hybrid'' data often present
in modern database applications constitutes the main contribution of
our work.
Although our modeling of semistructured data is largely inspired by
OEM and {em Lorel}, it is in some aspects richer. In particular, we
consider semistructured lists and distinguish between components of
semistructured data objects and references. These are essential
features to properly handle XML data. Also, since semistructured data
here lives in the context of structured data, we have to provide a
refined control of typing: typing is strictly enforced on the regular
portion of the data and, like in Lorel, coercion is used extensively
on the semistructured portion. Finally, we introduce the notion of
{em proxy} that allows to view as semistructured (so untyped) some
data that is in reality structured in the database.
We present in the paper the {em Ozone} system that is built as a
wrapper on top of the ODMG compliant O$_2$ system. Ozone fully
supports the extensions of the ODMG model and of OQL that we
introduce.
Stephane Grumbach, Maurizio Rafanelli, Leonardo Tininini-
1999
We introduce a first-order language with real polynomial arithmetic
and aggregation operators (count, iterated sum and multiply), which is
well suited for the definition of aggregate queries involving complex
statistical functions. It offers a good trade-off between expressive
power and complexity, with a tractable data complexity.
Interestingly, some fundamental properties of first-order with real
arithmetic are preserved in the presence of aggregates. In
particular, there is an effective quantifier elimination for formulae
with aggregation.
We consider the problem of querying data that has already been
aggregated in aggregate views, and focus on queries with an
aggregation over a conjunctive query. Our main conceptual contribution
is the introduction of a new equivalence relation among conjunctive
queries, the isomorphism modulo a product. We prove that the
equivalence of aggregate queries such as for instance averages reduces
to it. Deciding if two queries are isomorphic modulo a product is
shown to be NP-complete. We then show that the problem of complete
rewriting of count queries using count views is also
NP-complete. Finally, we introduce new rewriting techniques based on
the isomorphism modulo a product to recover the values of counts by
complex arithmetical computation from the views. We conclude by showing
how these techniques can be used to perform automatic aggregation.
Michel Scholl, Stephane Grumbach, Philippe Rigaux, Luc Segoufin-
1999
This paper introduces DEDALE, one of the first implementations of
a database system based on the linear constraint model. DEDALE is a
long term project started in 1996 as a cooperation between the VERSO
group at INRIA and the VERTIGO group at CNAM. DEDALE is intended to
demonstrate the practical relevance of this model to handle geometric
applications in various areas such as GIS or spatio-temporal
databases. The system represents and manipulates multi-dimensional
pointsets as linear constraint relations which are nested into
classical relational tuples. Although the (global) dimension is
arbitrary, the orthographic dimension (orthodim)is at most~2.
This restriction allows for a trade-off between efficiency of evaluation and representation power.
Sophie Cluet, Olga Kapitskaia, Divesh Srivastava-
1999
LDAP (Lightweight Directory Access Protocol)
directories have recently proliferated with the growth
of the Internet, and are being used in a wide variety of
network-based applications to store data such as personal
profiles, address books, and network and service policies.
These systems provide a means for managing heterogeneity
in a way far superior to what conventional relational or
object-oriented databases can offer.
To achieve fast performance for declarative query answering,
it is desirable to use client caching based on semantic
information (instead of individual directory entries).
We formally consider the problem of reusing cached LDAP
directory entries for answering declarative LDAP queries.
A semantic LDAP directory cache contains directory entries,
which are semantically described by a set of query templates.
We show that, for conjunctive queries and LDAP directory
caches with positive templates, the complexity of
cache-answerability is NP-complete in the size of the query.
For this case, we design a sound and complete algorithm
for cache-answerability based on a suite of query
transformations that capture the semantics of LDAP queries.
We demonstrate the practicality of this algorithm for real
applications with a performance evaluation, based on sample
queries from a directory enabled application at AT&T Labs.
When the query templates in the cache contain negation, we
show that the complexity of cache-answerability of
conjunctive LDAP queries is co-NP complete in the size of
the schema and query templates in the semantic description
of the cache.
Finally, we identify natural restrictions on the nature of
the semantic descriptions for polynomial-time
cache-answerability.
Serge Abiteboul-
1999
The notion of views is essential in databases. It allows various
users to see data from different viewpoints. In the present paper, we
informally present works of the author on the topic. Instead of
addressing the issue of views in a classical database setting, the
paper focuses on XML and looks at these various works in
the context of a system of views for XML.
Serge Abiteboul, Bernd Amann, Sophie Cluet, Tova Milo, Cassio Souza Dos Santos, Anne-Marie Vercoustre, Brendan Hills, Laurent Mignet, Vincent Aguilera, Sebastien Ailleret, Frederic Hubert, Jean-Claude Mamou, Amelie Marian, Bruno Tessier-
1999
The goal of this demonstration is to present the main features of (i)
Axielle, an XML repository developed by Ardent Software
on top of the O$_2$ object-oriented DBMS and (ii) the ActiveView
system which has been built by the Verso project at
INRIA on top of Axielle.
The demonstration is based on a simple electronic commerce application
which will be described in Section 2. Electronic commerce
is emerging as a major Web-supported application. It involves
handling and exchange of data (e.g. product catalogs, yellow pages,
etc.) and must provide (i) database functionalities (query language,
transactions, concurrency control, distribution and recovery) for the
efficient management of large data volumes and hundreds of users as well
as (ii) standard data storage and exchange formats (e.g. XML, SGML)
for the easy integration of existing software and data.
The ActiveView system combined with the Axielle XML repository enables
a fast deployment of electronic commerce applications based on a new
high-level declarative specification language (AVL), advanced
database technology (object-oriented data model, XML query language,
notifications), Web standards (HTTP, HTML) and other Internet
compliant technologies (Java, RMI).
Daniel K.C. Chan-
1998
It can be argued that report generation is the most frequently
performed task in database applications. Therefore the efficacy
and efficiency of a database report generation mechanism has a
significant impact on productivity. This paper introduces a
document-driven approach as opposed to the traditional
schema-driven approach to report generation resulting in more
flexibility. In this approach, generating a report involves
specifying it in terms of a user defined SGML-based report model.
Contents of a report can be specified using a transformation
language together with queries that retrieve data from different
databases. A report is output as a SGML document which can be
further edited as well as translated to other formats, for
instance to HTML to be viewed using an Internet browser. This
paper presents the approach using an example and discusses the
features and usage of the transformation language which is a
small but expressive language. Despite of the fact that we have
only investigated the transformation against relational databases,
we believe that this approach can unify report generation for
different database models.
Anne-Marie Vercoustre, Fran Paradis-
1998
The Web is creating exciting new possibilities for direct and instantaneous publishing of information. However, the apparent ease with which one can publish
documents on the Web hides more complex issues such updating and maintaining Web
pages. We believe one of the crucial requirements to document delivery is the ability to extract and reuse information from other documents or sources. In this paper we present a descriptive language that allows users to write virtual documents, where dynamic information can be retrieved from various sources, transformed, and included along with static information in HTML documents. The language uses a tree-like structure for the representation of information, and defines a database-like query language for extracting and combining information without a complete knowledge of the structure or the types of information. The data structures and the syntax of the language are presented along with examples.
Serge Abiteboul, Sophie Cluet, Tova Milo-
1998
Structured data stored in files can benefit from standard database
technology. In particular, we show here how such data can be queried
and updated using declarative database languages. We introduce the
notion of structuring schema which consists of a grammar
annotated with database programs. Based on a structuring schema, a
file can be viewed as a database structure, queried and updated as
such.
For queries, we show that almost standard database
optimization techniques can be used to answer queries without having
to construct the entire database. For updates, we study in
depth the propagation to the file of an update specified on the
database view of this file. The problem is infeasible in general and
we present a number of negative results.
The positive results consist of techniques that allow to
propagate updates efficiently under some reasonable locality
conditions on the structuring schemas.
Daniel Chan, Jochem Vonk, Gabriel Sanchez, Paul Grefen, Peter Apers-
1998
The workflow paradigm is advancing the scope of
process modelling by supporting complex control flows, rich process
structures, and the integration of heterogeneous and legacy systems.
However, studies of the paradigm have been seriously hindered by the
lack of a concrete framework. This paper presents such a framework in
the form of a workflow specification language. Unlike many other
proposals, the specification language captures all the fundamental
elements of the workflow paradigm at an abstraction level that is
suitable for workflow application designers. More specifically, it
provides a rich organisation model that can capture distribution using
domains, an information model that also records presentation details,
a sophisticated process model that supports active rules and workflow
transactions, and the relationships between the three models. This
paper presents the essential elements of the specification language
and illustrates them with a comprehensive example. Workflow
application developers should find the specification language a useful
and compact means to capture and investigate design details. Workflow
system developers would discover the language a good vehicle to study
the interaction between different features as well as facilitate the
development of more advanced features. Others would attain a better
understanding of the workflow paradigm and could use the language as a
basis of evaluation for the functionality of workflow systems.
Anne-Marie Vercoustre, Fran Paradis, Brendan Hills-
1998
The importance of reuse of information is well recognised for
electronic publishing. However, it is rarely achieved
satisfactorily because of the complexity of the task:
integrating different formats, handling updates of information,
addressing document author's need for intuitiveness
and simplicity, etc. An approach which addresses these problems
is to dynamically generate and update documents through a
descriptive definition of emph{virtual documents}.
In this paper we present a document interpreter that allows
gathering information from multiple sources, and combining
it dynamically to produce a virtual document.
Two strengths of our approach are: the generic information
objects that we use, which enables access to distributed,
heterogeneous data sources; and the interpreter's evaluation
strategy, which permits a minimum of re-evaluation of the
information objects from the data sources.
Serge Abiteboul, Victor Vianu, Bernd Amann, Sophie Cluet, Tova Milo-
1998
Electronic commerce is emerging as a major Web-supported application. We use object
database technology as the backbone for a wide range of electronic commerce applications. More
precisely, we present a system named ActiveView built on the O2 object database system. The
ActiveView system uses extensively database features such as views or active rules (triggers). It
provides a sophisticated access control mechanism and the means of logging certain specified
events to trace the activity of an application. Also, the ActiveView system uses the Spocus
framework to capture the workflow component based on some simple and limited set of rules.
We use examples to show that the system thereby obtained is appropriate for electronic
commerce applications involving interactions between individuals over the network. More
precisely, we show how such an application can be declaratively specified, how it is run by the
system, and how its activity can be controled and traced.
Serge Abiteboul, O. Duschka-
1998
We study the data complexity of the problem of computing
certain answers given view definitions, an instance of the views, and
a query. We consider conjunctive queries, conjunctive queries with
inequality, positive queries, datalog, and first order queries as
query and view definition languages. We show that the choice between
the assumption that views are complete and the assumption that some
tuples might be missing has a considerable impact on the complexity of
the problem. Our results imply that in many cases datalog programs
are not expressive enough to answer queries using views in a best
possible way. Finally, we examine the complexity of two related
problems: view self-maintainability and view consistency.
Michel Scholl, Philippe Picouet, Vincent Oria, Jean-Marc Saglio, Oliver Guenther-
1998
This paper addresses the issue of benchmarking spatial joins. For this purpose, we first present a web based benchmark generator to produce sets of rectangles. Second, using this generator and a set of statistical models, we define several tests to compare the performance of three spatial join algorithms. Our results show that the relative performance of the different techniques mainly depends on sample size and selctivity of the join predicate.
Sophie Cluet, Claude Delobel, Sihem Amer-Yahia-
1998
We present a framework for designing, in a
declarative and flexible way, efficient migration programs and an
undergoing implementation of a migration tool called RelOO whose
targets are any ODBC compliant system on the relational side and the
O2 system on the object side. The framework consists of (i) a
declarative language to specify the migration process and (ii) an
algebra-based program rewriting technique. The main originality of
the language resides in the fact that it does not only provide the
means to specify the database transformations from relations to
objects, but also physical properties we want to achieve on the object
side (such as clustering and sorting). The originality of the
rewriting technique relies in its ability to optimize the migration
processing time while taking into account physical properties and
transaction decomposition. To achieve this, we introduce equivalences
of a new kind that consider side-effects on the object database.
Anne-Marie Vercoustre, Craig A. Lindley-
1998
The FRAMES project is developing a system for video database search, content-based retrieval, and virtual video program synthesis. For dynamic synthesis applications, a video program is specified at a high level using a virtual video prescription. The prescription is a document expressed in a semi-formal language specifying the video structure, including formal specifications for direct reference to specific video components, components specified and retrieved by parametric query, and components generated as associative chains based upon an initial specification. This paper describes the syntax and semantics of the specification language for associative chaining. The language allows the specification of entity types for association generation, the specification of conjunctions and disjunctions of entity types, initial values, fixed constraints, and weights upon entity types. A short example is presented to illustrate the use of the language and how it may be used to generate different video sequences from a common database of video components.
Sophie Cluet-
1998
This paper tells the story of OQL, the standard query language of
the Object Database Management Group (ODMG).
The story starts in 1988, at INRIA in the Alta
Serge Abiteboul, Svetlazar Nestorov, Rajeev Motwani-
1998
Of late, especially with the advent of the world-wide web, there has been an increasing interest in semistructured data in the database community. Semistructured data is characterized by the lack of any fixed and rigid schema, although typically the data has some implicit structure. While the lack of fixed schema makes extracting semistructured data fairly easy and an attractive goal, presenting and querying such data is greatly impaired. Thus, a critical problem is the discovery of the structure implicit in semistructured data and, subsequently, the recasting of the raw data in terms of this structure.
In this paper, we consider a very general form of semistructured data based on labeled, directed graphs. We show that such data can be typed using the greatest fixpoint semantics of monadic datalog programs. We present an algorithm for approximate typing of semistructured data. We establish that the general problem of finding an optimal such typing is NP-hard, but present some heuristics and techniques based on clustering that allow efficient and near-optimal treatment of the problem. We also present some preliminary experimental results.
Serge Abiteboul, Yannis Papakonstantinou, Hector Molina, Ramana Yerneni-
1998
Fusion queries search for information integrated from distributed, autonomous sources over the Internet. In this context, data is not cleanly fragmented as in traditional distributed databases, and the number of sources participating in a typical query is large. We investigate techniques for efficient processing of fusion queries. First, we focus on a very wide class of query plans that capture the spirit of many techniques usually considered in existing systems. We show how to efficiently find, under various realistic scenarios, good query plans within this large class. We evaluate the performance of these plans and provide additional heuristics that, by considering plans outside our target class of plans, yield further performance improvements.
Serge Abiteboul, Janet Wiener, Jason Mc Hugh, Michael Rys, Vassilis Vassalos-
1998
Semistructured data is not strictly typed like relational or object-oriented data and may be irregular or incomplete. It often arises in practice, e.g., when heterogeneous data sources are integrated or data is taken from the World Wide Web. Views over semistructured data can be used to ,lter the data and to restructure (or provide structure to) it. To achieve fast query response time, these views are often materialized. This paper studies incremental maintenance techniques for materialized views over semistructured data. We use the graph-based data model OEM and the query language Lorel, developed at Stanford, as the framework for our work. We propose a new algorithm that produces a set of queries that compute the changes to the view based upon a change to the source. We develop an analytic cost model and compare the cost of executing our incremental maintenance algorithm to that of recomputing the view. We show that for nearly all types of database updates, it is more e
Daniel Chan, Karl Leung, Ching Ying Yan, Keith Chan-
1998
The workflow paradigm is advancing the scope of process
modelling by supporting rich process structures as well
as resource allocation through the use of an organisation
model. Its initial success has attracted a lot of
interests in applying it to
applications that are more complex than what can be
handled by the current generation of workflow management
systems. This paper introduces a new workflow model called
Liaison that is designed to address the needs of
these novel applications. More specifically, it provides a rich
organisation model,
sophisticated activity assignment constraints, an information
model that also records presentation details, and a dynamic
process model. An implemented prototype is described together
with an examination of desirable improvements.
Serge Abiteboul, S. Nestorov, R. Motwani, S. Tsur, J. D. Ullman, C. Clifton, A. Rosenthal-
1998
Association-rule mining has proved a highly successful technique for extracting useful information from very large databases. This success is attributed not only to the appropriateness of the objectives, but to the fact that a number of new query-optimization ideas, such as the "a-priori" trick, make association-rule mining run much faster than might be expected. In this paper we see that the same tricks can be extended to a much more general context, allowing efficient mining of very large databases for many different kinds of patterns. The general idea, called "query flocks," is a generate-and-test model for data-mining problems. We show how the idea can be used either in a general-purpose mining system or in a next generation of conventional query optimizers.
Victor Vianu, Luc Segoufin-
1998
The paper investigates the use of topological annotations (called
topological invariants) to answer topological queries in spatial
databases. The focus is on the translation of topological queries
against the spatial database into queries against the topological
invariant. The languages considered are first-order on the spatial
database side, and fixpoint and first-order on the topological
invariant side. In particular, it is shown that fixpoint expresses
precisely the ptime queries on topological invariants.
Serge Abiteboul, Victor Vianu, C.H. Papadimitriou-
1998
We propose a model of database programming with reflection (dynamic generation of queries within the host programming language), called the reflective relational machine, and characterize the power of this machine in terms of known complexity classes. In particular, the polynomial-time restriction of the reflective relational machine is shown to express PSPACE, and to correspond precisely to uniform circuits of polynomial depth and exponential size. This provides an alternative, logic-based formulation of the uniform circuit model, that may be more convenient for problems naturally formulated in logic terms, and establishes that reflection allows for more "intense" parallelism, which is not attainable otherwise (unless P = PSPACE). We also explore the power of the reflective relational machine subject to restrictions on the number of variables used, emphasizing the case of sublinear bounds.
Serge Abiteboul, Victor Vianu, Brad Fordham, Yelena Yesha-
1998
Electronic commerce is emerging as one of the major Websupported applications requiring database support. We introduce and study high-level declarative speci,cations of business models, using an approach in the spirit of active databases. More precisely, business models are speci,ed as relational transducers that map sequences of input relations into sequences of output relations. The semantically meaningful trace of an input-output exchange is kept as a sequence of log relations. We consider problems motivated by electronic commerce applications, such as log validation, verifying temporal properties of transducers, and comparing two relational transducers. Positive results are obtained for a restricted class of relational transducers called Spocus transducers (for semi-positive outputs and cumulative state). We argue that despite the restrictions, these capture a wide range of practically signi,cant business models.
Serge Abiteboul, S. Chawathe, J. Widom-
1998
Semistructured data may be irregular and incomplete and does not
necessarily conform to a fixed schema. As with structured data, it is
often desirable to maintain a history of changes to data, and to query
over both the data and the changes. Representing and querying changes
in semistructured data is more difficult than in structured data due
to the irregularity and lack of schema. We present a model for
representing changes in semistructured data and a language for
querying over these changes. We discuss implementation strategies for
our model and query language. We also describe the design and
implementation of a "query subscription service" that permits standing
queries over changes in semistructured information sources.
Anne-Marie Vercoustre, Fran Paradis-
1998
As the WWW becomes a major source of information, a lot of
interest has arisen, not only for searching for information, but
for reusing this information in new pages or directly from
applications. Unfortunately HTML tags do not provide a
significant level of structure for identifying and extracting
information, since they are mostly used for presentation issues.
Moreover the simple link mechanism of the Web does not support
the controlled traversal of links to related pages. Particularly
promising is the proposal for a new standard, XML, which could
bring the power of SGML to the Web while keeping the simplicity
of HTML. In this paper we present a system and a language that
allows reusing information from various sources, including
databases and SGML-like documents, by combining it dynamically
to produce a virtual document. The language uses a tree-like
structure for the representation of information objects as well
as links between objects. The paper focuses on the selection and
the traversal of XML links to extract information from linked
pages. The strength of our approach is to be an SGML-compliant
solution, which makes it ready to take full advantage of XML for
reusing information from the Web as soon as it is widely used.
Stephane Grumbach, Philippe Rigaux, Luc Segoufin-
IEEE Transactions on Knowledge and Data Engineering 1998
Most spatial information systems are limited to a fixeddimension (generally 2) which is not extensible. On the other hand, the emerging paradigm of constraint databasesallows the representation of data of arbitrary dimension, togetherwith abstract querylanguages. The complexity of evaluating queries though might be costlyif the dimension of the objects is really arbitrary.In this paper, we present a data model, based on linear constraints, dedicated to the representationand manipulation of multidimensional data.In order to preserve a low complexity for query evaluation,we introduce the structural dimension of an object $O$,as the dimension of the components $O_1,...,O_n$, such that$O=O_1imescdotsimes O_n$. This allows to processqueries independently on each component, therefore achieving asatisfying trade-off between design simplicity, expressive power ofthe query language and efficiency of query evaluation.We illustrate these concepts in the context of spatio-temporaldatabases where {em space} and {em time} are the natural components.This data model has been implemented in the dedale/ system and aspatio-temporal application, with structural dimension 2, is currentlyrunning, thus showing the practical relevance of the approach.
Michel Scholl, Vassilis Christophides, Anne-Marie Vercoustre, Alain Michard, Mike Stapleton, Dale Sutcliffe-
1998
Aquarelle is a three-year project supported by the Telematics Applications Programme of the European Union, aiming at
designing a resource discovery system on the Internet, applied to cultural heritage documentation. The system relies on
the Z39.50 protocol to support access to heterogeneous databases, including SGML document repositories. Its most
original features are direct linking from SGML documents to database records, an advanced link management facility, and query broadcasting to dynamically selected databases.
Stephane Grumbach, Philippe Rigaux, Luc Segoufin-
1998
This paper presents Dedale, a spatial database system intended to
overcome some limitations of current systems by providing an
abstract and non-specialized data model and query language for the
representation and manipulation of spatial objects. Dedale relies
on a logical model based on linear constraints, which generalizes
the constraint database model of [KKR90]. While in the
classical constraint model, spatial data is always decomposed into
its convex components, in Dedale holes are allowed to fit the need
of practical applications. The logical representation of spatial
data although slightly more costly in memory, has the advantage of
simplifying the algorithms. Dedale relies on nested relations, in
which all sorts of data (thematic, spatial, etc.) are stored in a
uniform fashion. This new data model supports declarative query
languages, which allow an intuitive and efficient manipulation of
spatial objects. Their formal foundation constitutes a basis for
practical query optimization. We describe several evaluation rules
tailored for geometric data and give the specification of an
optimizer module for spatial queries. Except for the latter module,
the system has been fully implemented upon the O2 DBMS, thus
proving the effectiveness of a constraint-based approach for the
design of spatial database systems.
Stephane Grumbach, Philippe Rigaux, Luc Segoufin-
1998
This paper presents dedale, a spatial database system intended to
overcome some limitations of current systems by providing an abstract and
non-specialized data model and query language for the representation and
manipulation of spatial objects. dedale relies on a logical model
based on linear constraints, which generalizes the constraint database model of
[KKR90]. While in the classical constraint model, spatial data is always
decomposed into its convex components, in dedale holes are allowed to
fit the need of practical applications. The logical representation of spatial
data although slightly more costly in memory, has the advantage of simplifying
the algorithms. dedale relies on nested relations, in which all sorts
of data (thematic, spatial, etc.) are stored in a uniform fashion. This new
data model supports declarative query languages, which allow an intuitive and
efficient manipulation of spatial objects. Their formal foundation constitutes
a basis for practical query optimization. We describe several evaluation rules
tailored for geometric data and give the specification of an optimizer module
for spatial queries. Except for the latter module, the system has been fully
implemented upon the O2 DBMS, thus proving the effectiveness of a
constraint-based approch for the design of spatial database systems.
Sophie Cluet, Jerome Simeon-
1998
Integration of heterogeneous data sources in a Web environment has
become a major concern of the database community. Architectures,
data models and query languages have been proposed but the
complementary problem of data conversion has been less studied. The
YAT system provides a means to build software components based on
data conversion, such as wrappers or mediators, in a simple and
declarative way. We show that the YAT system can also be used to
create integrated Web views over heterogeneous data sources very
easily. Only minor changes were required for YAT to provide data
integration (as opposed to data conversion) in a Web environment.
Finaly, we report on our experience while building the
Verso Web
site using YAT.
Daniel Chan, Karl Leung-
1998
The workflow paradigm is advancing the scope of
process modelling by supporting complex control flows, rich process
structures, and the integration of heterogeneous and legacy systems.
However, studies of the paradigm have been seriously hindered by the
lack of a concrete framework. In order to improve understanding of
the paradigm, enable studying of interaction between different
features, and facilitate the development of more advanced
features, this paper presents a framework in the form of a workflow
language called Valmont. Unlike many other proposals, Valmont captures
all the fundamental elements of the workflow paradigm at an
abstraction level that is suitable for workflow application designers.
More specifically, it provides a rich organisation model that can
capture distribution using domains, an information model that also
records presentation details, as well as a sophisticated process model that
supports active rules and workflow transactions. This paper presents
an overview of Valmont by providing a solution to the ISPW-6 Software
Process Example. It also serves to provide an insight into the needs
of novel application domains some of which remain to be addressed by
the next generation of workflow management systems.
Anne-Marie Vercoustre, Craig Lindley-
1998
To facilitate maximum reuse of video data, well-specified and systematic methods are required for
the incorporation of digital video components into multiple delivery products.
This paper presents an approach to the intelligent synthesis of digital video products
by the representation of virtual videos using SGML. Virtual videos are
defined using a Virtual Video Prescription to specify a generic model
of a video production that includes embedded queries for the dynamic retrieval
of video content from underlying databases. An achitecture for virtual
video synthesis is decribed, and a detailed example is presented to demonstrate the prescription
language syntax and its underlying concepts. Virtual Video Prescriptions for various video types
and genres can include increasingly complex video models and composition rules, leading to
a synthesis of SGML-based document technology with knowledge base systems technology.
Sophie Cluet, Claude Delobel, Jerome Simeon, Katarzyna Smaga-
1998
Due to the development of the World Wide Web, the integration of
heterogeneous data sources has become a major concern of the database
community. Appropriate architectures and query languages have been
proposed. Yet, the problem of data conversion which is essential for
the development of mediators/wrappers architectures has remained
largely unexplored.
In this paper, we present the
YAT system for data conversion. This
system provides tools for the specification and the implementation of
data conversions among heterogeneous data sources. It relies on a
middleware model, a declarative language, a customization mechanism
and a graphical interface.
The model is based on named trees with ordered and labeled nodes. Like
semistructured data models, it is simple enough to facilitate the
representation of any data. Its main originality is that it allows to
reason at various levels of representation. The YAT conversion
language (called YATL) is declarative, rule-based and features
enhanced pattern matching facilities and powerful restructuring
primitives. It allows to preserve or reconstruct the order of
collections. The customization mechanism relies on program
instantiations: an existing program may be instantiated into a more
specific one, and then easily modified. We also present the
architecture, implementation and practical use of the YAT prototype,
currently under evaluation within the OPAL project.
Anne-Marie Vercoustre, F. Paradis-
1997
The importance of reuse is well recognised for electronic document
writing. However, it is rarely achieved satisfactorily because of the
complexity of the task: integrating different formats, handling
updates of information, addressing document author's need for
intuitiveness and simplicity, etc. In this paper, we present a
language for information reuse that allows users to write virtual
documents, where dynamic information objects can be retrieved from
various sources, transformed, and included along with static
information in SGML documents. The language uses a tree-like
structure for the representation of information objects, and allows
querying without a complete knowledge of the structure or the types of
information. The data structures and the syntax of the language are
presented through an example application. A major strength of our
approach is to treat the document as a non-monolithic set of reusable
information objects.
Daniel Chan, Phil Trinder-
1997
In order to execute database queries efficiently, they
are usually translated into an algebra: a small set of
operations together with some equivalences. The
operations are implemented very efficiently by the
database engine, and the equivalences are used to optimise
the query. Since query languages for object-oriented
databases are more complex than their relational counterparts,
the algebra and translation, or extit{processing framework},
is more elaborate. This paper describes the processing
framework for object comprehensions, a powerful
object-oriented query language. Both operations and
equivalences of the algebra are defined, together with a
translation into it. The framework is illustrated by
translating and optimising an example object comprehension
query.
Michel Scholl, Philippe Picouet, Oliver Gunther, Vincent Oria, Jean-Marc Saglio-
1997
Abstract Spatial joins are join operations that involve spatial data
types and operators. Spatial access methods are often used to speed up
the computation of spatial joins. This paper addresses the issue of
benchmarking spatial join operations. For this purpose, we first
present a WWW-based tool to produce sets of
rectangles. Experimentators can use a standard Web browser to specify
the number of rectangles, as well as the statistical distributions of
their sizes, shapes, and locations. Second, using the rectangle
generator and a well-defined set of statistical models we defined
several test suites to compare the performance of three spatial join
algorithms: nested loop, scan-and-index, and synchronized tree
traversal. We also added a real life data set, the Sequoia 2000
storage benchmark. Our results confirm that the use of spatial indices
leads to performance gains of several orders of magnitude. The tests
also show that highly selective join predicates enjoy greater
performance gains, and vice versa. All of the statistical models and
algorithms are available on the Web, which allows for easy
verification and modification of our experiments.
Serge Abiteboul, Sophie Cluet, Tova Milo-
1997
A primary motivation for new database technology is to provide support
for the broad spectrum of multimedia data available notably through
the network. These data are stored under different formats: SQL or
ODMG (in databases), SGML or LaTex (documents), DX formats (scientific
data), Step (CAD/CAM data), etc. Their integration is a very active
field of research and development (see for instance, for a very small
sample, [10, 6, 7, 9, 8, 12, 19, 20]). In this paper, we provide a
formal foundation to facilitate the integration of such heterogeneous
data and the maintenance of heterogeneous replicated data.
Victor Vianu-
1997
Databases provide one of the main concrete scenarios for finitemodel
theory within computer science. This paper presents an informal
overview of database theory aimed at finite-model theorists,
emphasizing the specificity of the database area. It is argued that the
area of databases is a rich source of questions and vitality for
finite-model theory.
Michel Scholl, Stephane Grumbach, Philippe Rigaux, Luc Segoufin-
1997
This paper presents a first prototype of a constraint database for
spatial information, dedale. Implemented on top of the O2 DBMS, data
is stored in an object-oriented framework, with spatial data
represented using linear constraints over a dense domain. The query
language is the standard OQL, with special functions for constraint
solving and geometric operations. A simple geographical application
from the French Institute for Geography, IGN, is running on
dedale. The data initially in vector mode was loaded into the database
after a translation to constraint representation. Although it is too
early to speak of performance since not all operations have been
optimized yet, our experience with dedale demonstrates already the
advantages of the constraint approach for spatial manipulation.
Serge Abiteboul, B. Fordham, Y. Yesha-
1997
Many complex and dynamic database applications such as product
modeling and negotiation monitoring require a number of features that
have been adopted in semantic models and databases such as active
rules, constraints, inheritance, etc. Unfortunately, each feature has
largely been considered in isolation. Furthermore, in a commercial
negotiation, participants staking their financial wellbeings will
never accept a system they cannot gain a precise behavioral
understanding of. We attack these problems with a rich and extensible
database model, evolving databases, with a clear and precise semantics
based on evolving algebras [5]. We also briefly describe a prototype
implementation of the model [12].
Victor Vianu, Philippe Picouet-
1997
The expressiveness and complexity of several active database
prototypes are formally studied. First, a generic framework for the
specification of active databases is developed. This factors out the
common aspects of the prototypes considered, and allows studying
various active database features independently of any specific
prototype. Furthermore, each of the prototypes can be specified by
specializing certain parameters of the framework. The prototypes
considered are ARDL, HiPAC, Postgres, Starburst, and Sybase. Using
their formal specifications, the prototypes are compared to each other
with respect to expressive power. The results provide insight into the
programming paradigm of active databases, the interplay of various
features, and their impact on expressiveness and complexity.
Serge Abiteboul, S. Nestorov, R. Motwani-
1997
When dealing with semistructured data such as that available on the
Web, it becomes important to infer the inherent structure, both for
the user (e.g., to make querying easier) and for the system (e.g., to
optimizeaccess). In this paper, we consider the problem of identifying some
underlying structure in large collections of semistructured data. Since
we expect the data to be fairly irregular, this structure consists of
an approximate classification of objects into an hierarchical
collection of types. We propose a notion of a type hierarchy for such
data, an algorithm for deriving the type hierarchy, and rules for
assigning types to data elements.
Serge Abiteboul, J. Widom, J. McHugh, R. Goldman, D. Quass-
1997
Lore (for Lightweight Object Repository) is a DBMS designed
specifically for managing semistructured information. Implementing Lore
has required rethinking all aspects of a DBMS, including storage
management, indexing, query processing and optimization, and user
interfaces. This paper provides an overview of these aspects of the
Lore system, as well as other novel features such as dynamic
structural summaries and seamless access to data from external
sources.
Sophie Cluet-
1997
Since its creation in 1989, the Web has known a tremendous success due
to its capacity to deliver very inexpensively information world wide.
Serge Abiteboul-
1997
In this paper, we discuss some aspects of database support for digital
libraries. From a DL perspective, database systems, and in particular,
object database systems provide a nice basis for future DL
systems. More generally, database research provides solutions to many
DL issues even if these are partial or fragmented. When possible, work
should not be duplicated and good software and ideas should be
reused. From a DB perspective, we want to stress that digital
libraries propose beautiful applications and challenges to DBMS
technology. They suggest a number of improvements to DBMSs that could
be beneficial beyond DL applications.
Claude Delobel, Zoe Lacroix, Philippe Breche-
1997
We present a formal data model for views in Object DataBase Systems
(ODBS) which is a continuation of the work presented in [AK89, AB91,
dSDA94, dS95a]. The main contribution of this paper is the
formalization of views as a transformation mechanism for databases
using a generalization of referent proposed in [dS95b]. We define an
IQL-like language adapted to manipulate referents. The view-based
transformation is achieved in two steps: an extension of the source
base followed by a projection of the extended base. The extension and
projection can be carried out using four object algebraic operators
that specify both the virtual schema and its corresponding virtual
instance. This simple algebra can express all the view operators
proposed in [KR96, LS92, dS95b, KRR95, CTR96, KR96].
Cassio Souza Dos Santos, Philippe Breche, Sihem Amer-Yahia-
1997
We present a mechanism for updating views in Object DataBase Systems
(ODBS). Starting from an existing view data model, chronologically in
line with [1, 2, 14], we extend it to provide data update
ability. Updates to data views are performed in such a way that
consistency with root data is preserved. The main contribution of this
paper is to detail such a mechanism together with its corresponding
data model. This is, to the best of our knowledge, the first
implementation of such a mechanism in the ODBS framework (as opposed
to the RDBS framework). Since this view mechanism is ODMG 93 compliant
- views are specified with OQL queries extended by operations on
attributes (attribute hiding, renaming and derivation) - it may be
implemented on top of various ODBSs. We discuss issues on
implementation on top of the O2 ODBS and consider related
work. Various applications of updatable views conclude the article.
Claude Delobel, Zoe Lacroix, Philippe Breche-
1997
We present a formal data model for views in Object DataBase Systems
(ODBS) which is a continuation and a formalization of the work
presented in [AK89, AB91, ADdS94, dS95a]. The main contribution of
this paper is the formalization of the views as a transformation
mechanism for databases using a generalization of referent proposed in
[dS95b]. The view-based transformation is achieved in two steps: an
extension of the source base followed by a projection of the extended
base. The extension and projection can be carried out using five
object algebraic operators which are described in detail in this
paper. Equally important, this paper highlights the challenges
involved supporting our view mechanism. We show how this basic model
for views point to new perspectives, namely, extension of OQL,
transformation of query on the view in a query on the source schema
and base, and finally characterization of the class of
view-transformations our algebra captures.
Stephane Grumbach, Zoe Lacroix-
1997
Non-deterministic computation is not really computation, but the
difference with real computation is blurred. We study in detail various
levels of non-determinism in computations on non-deterministic Turing
machines with polynomial bounds on the resources. Meanwhile, we
consider numerous query languages, implicit logic, choice logic, order
invariant logic, and restrictions of second-order logic, and establish
correspondences between all these formalisms for both deterministic
and non-deterministic queries. To the degrees of non-determinism in
the computations, correspond levels of non-determinism in the logical
definitions. Our main contribution is to characterize various
complexity classes between Ptime and Pspace, by several logical means,
thus translating open questions in complexity theory to open questions
in logic related to the use of the non-determinism.
Sophie Cluet, Vassilis Christophides, Guido Moerkotte, Jerome Simeon-
1997
Query languages for object bases became enriched by generalized path
expressions that allow for attribute and path variables. Optimizing
queries containing generalized path expressions attracted some
interest. However, many interesting queries require still a full scan
over the whole object base. This unbearable situation can be remedied
best by utilizing index structures. However, traditional database
indexes fail to support generalized path expressions. We propose to
use a flexible mix of database and full text indexes in order to
evaluate efficiently queries containing generalized path
expressions. We introduce an algebraic framework for the optimization
of such queries using full text indexes, and we report on a first
prototype implementation including some performance results.
Serge Abiteboul, Victor Vianu-
1997
The paper introduces a model of the Web as an infinite, semistructured
set of objects. We reconsider the classical notions of genericity and
computability of queries in this new context and relate them to styles
of computation prevalent on the Web, based on browsing and
searching. We revisit several well-known declarative query languages
(first-order logic, Datalog, and Datalog with negation) and consider
their computational characteristics in terms the notions introduced in
this paper. In particular, we are interested in languages or fragments
thereof which can be implemented by browsing, or by browsing and
searching combined. Surprisingly, stratified and well-founded semantics
for negation turn out to have basic shortcomings in this context,
while inflationary semantics emerges as an appealing alternative.
Stephane Grumbach, Jianwen Su-
1997
In this paper, we study the expressive power and the complexity of
first-order logic with arithmetic, as a query language over relational
and constraint databases. We consider constraints over various domains
(N, Z, Q, and R), and with various arithmetical operations (6, +,
Theta , etc.).
We first consider the data complexity of first-order queries. We prove
in particular that linear queries can be evaluated in AC0 over finite
integer databases, and in NC1 over linear constraint databases. This
improves previously known bounds. We also show that over all domains,
enough arithmetic lead to arithmetical queries, therefore, showing the
frontiers of constraints for database purposes.
We then tackle the problem of the expressive power, with the
definability of the parity and the connectivity, which are the most
classical examples of queries not expressible in first-order logic
over finite structures. We prove that these two queries are
first-order definable in presence of (enough)
arithmetic. Nevertheless, we show that they are not definable with
constraints of interest for constraint databases such as linear
constraints for instance. Finally, we developed reduction techniques
for queries over constraint databases, that allow us to draw
conclusions with respect to their undefinability in various constraint
query languages.
Serge Abiteboul, Sophie Cluet, Tova Milo, Vassilis Christophides, Guido Moerkotte, Jerome Simeon-
1997
We consider the problem of storing and accessing documents (SGML and
HTML, in particular) using database technology.
To specify the database image of documents, we use structuring schemas
that consist in grammars annotated with database programs. To query
documents, we introduce an extension of OQL, the ODMG standard query
language for object databases. Our extension (named OQL-doc) allows to
query documents without a precise knowledge of their structure using
in particular generalized path expressions and pattern matching. This
allows us to introduce in a declarative language (in the style of SQL
or OQL), navigational and information retrieval styles of accessing
data.
Query processing in the context of documents and path expressions
leads to challenging implementation issues. We extend an object
algebra with new operators to deal with generalized path
expressions. We then consider two essential complementary optimization
techniques: 1. we show that almost standard database optimization
techniques can be used to answer queries without having to load the
entire document into the database. 2. we also consider the interaction
of full-text indexes (e.g., inverted files) with standard database
collection indexes (e.g., B-trees) that provide important speedup.
The paper is an overview of a research project at
INRIA-Rocquencourt. Some particular aspects are detailed in [ACM93,
CACS94, ACM95, CCM96].
Serge Abiteboul-
1997
The amount of data of all kinds available electronically has increased
dramatically in recent years. The data resides in different forms,
ranging from unstructured data in file systems to highly structured in
relational database systems. Data is accessible through a variety of
interfaces including Web browsers, database query languages,
application-specific interfaces, or data exchange formats. Some of
this data is raw data, e.g., images or sound. Some of it has structure
even if the structure is often implicit, and not as rigid or regular
as that found in standard database systems. Sometimes the structure
exists but has to be extracted from the data. Sometimes also it exists
but we prefer to ignore it for certain purposes such as browsing. We
call here semi-structured data this data that is (from a particular
viewpoint) neither raw data nor strictly typed, i.e., not
table-oriented as in a relational model or sorted-graph as in object
databases.
Serge Abiteboul, Victor Vianu-
1997
The evaluation of path expression queries on semistructured data in a
distributed asynchronous environment is considered. The focus is on
the use of local information expressed in the form of path constraints
in the optimization of path expression queries. In particular,
decidability and complexity results on the implication problem for
path constraints are established.
Daniel Chan, Karl Leung-
1997
It is a general consensus that automated support for software
development is essential to harness the ever increasing complexity
of today's software. Many software development models, tools, and
environments have been introduced to address such a need; however, they
are usually methodology-specific and impose a rather authoritarian
policy on the way software is developed. This paper advocates the
use of workflow systems to enact the process of software development.
Besides being more general and flexible, the workflow paradigm
supports useful features lacking in other approaches. Also, it helps
to reduce development complexity by allowing both the software
development process and the software themselves to be captured using
the very same paradigm. This paper introduces a workflow system
being developed to support the software development process
by presenting a solution to the ISPW-6 Software Process Example
expressed in its specification language. This paper therefore serves
two purposes: (1) to introduce a new and more general approach to
software process enactment, and (2) to identify new requirements for
the workflow paradigm, such as event dependency, that are applicable
to many other advanced applications.
Anne-Marie Vercoustre, B. Hills-
1997
This paper explores the issue of representing textual information in
the form of virtual documents that include data and fragments of
documents from remote sources -- especially from databases and SGML
document databases. Virtual documents are dynamically generated, and
therefore always present up-todate information when they are
instantiated.. The benefit of this paradigm is that it allows
information to be shared, reused, and adapted for various contexts. A
Virtual document specification defines how to find and retrieve
information objects from databases or from existing documents, and how
to assemble it into another document. Virtual documents can be used
to create HTML pages that contain information from one or several
remote or local databases, to assembly parts of existing documents
into a new one, or to define various views of the same information
according to various needs. This paper focuses on the prototype
implementation of virtual documents from the perspectives of document
authoring and architecture. We propose an SGML syntax for Information
Object that includes OQL-queries for retrieving fragments of existing
documents, transformations on an intermediate tree representation, and
output mapping to the virtual document structures.
Serge Abiteboul, J. McHugh, R. Goldman, V. Vassalos, Y. Zhuge-
1997
Defining a view over a semistructured database introduces many new
problems. In this paper we propose a view specification language and
consider the problem of answering queries posed over views. The two
main approaches, query rewriting and view materialization, are
outlined with focus on the difficult problems caused by the
semistructured nature of the data.
Philippe Breche-
1996
Currently, existing Object Database Systems (ODBSs) perform schema
changes by means of primitives closely related to their respective
data model. Software Engineering (SE) applications, Object
Methodologies (OM) and designers building up object database schemata
require a more abstract level. This paper addresses new facilities for
updating a schema filling the gap between object design and object
programming. A set of advanced primitives, so{called "High Level
Primitives", is presented to cope with these requirements. The
semantics of these new primitives and how to maintain a schema
consistent after such a schema update are the main contributions of
this paper. We discuss issues on implementation on top of the
commercial ODBS O2 and consider related work.
Sophie Cluet, Vassilis Christophides, Guido Moerkotte-
1996
In the past few years, query languages featuring generalized path
expressions have been proposed. These languages allow the
interrogation of both data and structure. They are powerful and
essential for a number of applications. However, until now, their
evaluation has relied on a rather naive and ine
Marie-Jo Bellosta, Robert Wrembel, Genevieve Jomier-
1996
An object database is defined by a schema and an instance of the
schema. The schema describes the structure of the data stored in the
database, including classes, types associated with classes,
inheritance relationships between classes, and method interfaces. The
instance of the schema contains all objects stored in the database,
and method implementations. Expressiveness of object database models
is the main reason of their wide success in the CAD/CAM, CSCW and CASE
applications. Moreover, these applications require database versioning
in order to view and manage different states of the modeled
universe. Such databases are called multiversion. The main problem of
multiversion databases is to recognize which versions of different
multiversion entities, i.e. classes or/and objects, are consistent,
i.e., go together. As shown in [CJ90], a user is quickly lost if he
needs to manage explicitly consistency between all versions of both
classes and objects without some efficient support from the system.
Serge Abiteboul, Yannis Papakonstantinou, Hector Garcia-Molina-
1996
One of the main tasks of mediators is to fuse information from
heterogeneous information sources. This may involve, for example,
removing redundancies, and resolving inconsistencies in favor of the
most reliable source. The problem becomes harder when the sources are
unstructured/semistructured and we do not have complete knowledge of
their contents and structure. In this paper we show how many common
fusion operations can be specified non-procedurally and
succinctly. The key to our approach is to assign semantically
meaningful object ids to objects as they are "imported" into the
mediator. These semantic ids can then be used to specify how various
objects are combined or merged into objects "exported" by the
mediator. In this paper we also discuss the implementation of a
mediation system based on these principles. In particular, we present
key optimization techniques that significantly reduce the processing
costs associated with information fusion.
Michel Scholl, Gilles Grangeret, Xavier Rehse-
1996
Geographic Information Systems (GIS) utilize Database Management
Systems (DBMS) for storing and retrieving geo-referenced data. Such
data is described by means of attributes defined on domains such as
Integer, String, etc., as well as spatial attributes that represent
the data geometry and location on earth. As conventional DBMS, GIS
have to efficiently answer user queries, related to the objects
location (geometry) and expressed with a spatial query language. Then
one crucial challenge of GIS design is spatial query optimization.
Serge Abiteboul, Laurent Herr, Jan Van den Bussche-
1996
A database history can be modeled as a (finite) sequence of instances
discretely ordered by time. Similarly, the behavior of a system such
as an operating system or a reactive system can be modeled by an
infinite such sequence. One can view the sequence as one single
database where each relation has an additional column holding the time
instant of validity of each tuple. The temporal database can then be
queried using standard relational calculus (first-order logic) on this
"timestamp" representation. One may alternatively use an implicit
access to time and express queries in temporal logic. It is known that
these two approaches yield the same expressive power in the
propositional case. Their comparison in the predicate/database context
remained open. We prove here that there are first-order logic queries
on the timestamp representation that are not expressible in (extended)
temporal logic. The proof technique is novel and is based on
communication complexity.
Serge Abiteboul, Dallan Quass, Jason McHugh, Jennifer Widom, Janet Wiener-
1996
We present the Lorel language, designed for querying semistructured
data. Semistructured data is becoming more and more prevalent, e.g.,
in structured documents such as HTML and when performing simple
integration of data from multiple sources. Traditional data models and
query languages are inappropriate, since semistructured data often is
irregular, some data is missing, similar concepts are represented
using different types, heterogeneous sets are present, or object
structure is not fully known. Lorel is a user-friendly language in the
SQL/OQL style for querying such data effectively. For wide
applicability, the simple object model underlying Lorel can be viewed
as an extension of ODMG and the language as an extension of OQL.
The main novelties of the Lorel language are: (i) extensive use of
coercion to relieve the user from the strict typing of OQL, which is
inappropriate for semistructured data; and (ii) powerful path
expressions, which permit a flexible form of declarative navigational
access and are particularly suitable when the details of the structure
are not known to the user. Lorel also includes a declarative update
language.
Lorel is implemented as the query language of the Lore prototype
database management system at Stanford (see
http://www-db.stanford.edu/lore). In addition to presenting the Lorel
language in full, this paper briefly describes the Lore system and
query processor. We also discuss how Lorel could be implemented on top
of a conventional object-oriented database management system.
Serge Abiteboul, Sophie Cluet, Tova Milo-
1995
Database systems are concerned with structured data. Unfortunately,
data is still often available in an unstructured manner (e.g., in
files) even when it does have a strong internal structure (e.g.,
electronic documents or programs). In a previous paper [2], we
focussed on the use of high-level query languages to access such files
and developed optimization techniques to do so. In this paper, we
consider how structured data stored in files can be updated using
database update languages.
Victor Vianu, Sergio Lifschitz-
1995
We explore an approach to developing Datalog parallelization
strategies that aims at good expected rather than worst-case
performance. To illustrate, we consider a very simple parallelization
strategy that applies to all Datalog programs. We prove that this has
very good expected performance under equal distribution of
inputs. This is done using an extension of 0-1 laws adapted to this
context. The analysis is confirmed by experimental results on randomly
generated data.
Stephane Grumbach, Fariza Tahi-
1995
DNA sequences contain a high percentage of redundancies, such as
"palindromic" repetitions, that universal compression algorithms fail
to detect. We present the algorithm "Biocompress-2" to compress
nucleotide sequences, which, to the best of our knowledge, is the most
efficient algorithm on our corpus of DNA sequences. The rates are not
satisfactory, but this technique leads to the discovery of relevant
regularities, and presents a connection with the techniques of
construction of phylogenetic trees and prediction of secondary
structures of RNAs.
Stephane Grumbach, Zoe Lacroix-
1995
We consider rational linear constraint databases, and study the
problem of evaluating efficiently first-order queries with linear
constraints. We show that although queries admit low data complexity
(NC), their naive evaluation is rather inefficient. The computation
paradigm is based on relational algebra and constraint solving. We
focus on the former, and show that the query processing should differ
drastically from classical relational databases. In particular,
optimization strategies used for relational queries may be
inadequate. We describe the problem and propose some preliminary
optimization principles.
Serge Abiteboul, Jan Van den Bussche-
1995
We revisit the notion of deep equality among objects in an object
database from a formal point of view. We present three natural
formalizations of deep equality: one based on the infinite value-trees
associated with objects, one based on the greatest fixpoint of an
operator on equivalence relations among objects, and one based on
indistinguishability of objects using observations of atomic values
reachable from the objects. These three definitions are then shown to
be equivalent. The characterization in terms of greatest fixpoints
also yields a polynomial-time algorithm for checking deep equality. We
also study the expressibility of deep equality in deductive database
languages.
Cassio Souza Dos Santos-
1995
We discuss the design and implementation of the O2Views
object-oriented database view mechanism, which allows the redefinition
of both the structure and the behavior of objects stored in a
database. The data model extended with views is given and the
functionalities of the prototype implementing it are presented. The
paper focuses on the requirements for the implementation of an
object-oriented view mechanism, ranging from the conception of a view
definition language to optimization strategies for querying and
updating through views.
Sophie Cluet, Guido Moerkotte-
1995
A new method for efficiently evaluating queries with aggregate
functions is presented. More specifi- cally, we introduce a class of
aggregate queries where traditional query evaluation strategies in
general require O(n2) time and space in the size of the (at most two)
input relations. For this class of aggregate queries our approach
needs at most O(n log n) time and linear space. Further, our approach
deals not only with relations but with general bulk types like sets,
bags, and lists.
Michel Scholl, J. P. Peloux, G. Reynal de St Michel-
1995
We report a measurement experiment for evaluating the access
performance of point queries with several spatial data structures
implemented with O2C, the development language of the object-oriented
database management system (DBMS) O2 [O291]. Four data structures were
implemented and tested: the first one is a quadtree structure [Sam90],
the second one is an extension which turned out to be very close to
Hierarchical Excell [Tam83]. The third and the fourth data structures
are the well known tree structures called R*tree [BKSS90] and R+tree
[SRF87]. The objective we had in mind was twofold: (i) contribute to
the choice of spatial access methods to be implemented in geographic
DBMS in order to accelerate spatial queries among which point and
window queries are the most common; (ii) understand where the
performance bottlenecks are when implementing such structures with
commercial object-oriented DBMS. Our goal was not to find the best
data structure but to understand the behavior of each data structure
with an actual implementation.
Stephane Grumbach, Zoe Lacroix, Steven Lindell-
1995
We propose a natural generalization of the concept of implicit
definitions over finite structures, allowing non-determinism at an
intermediate level of a (deterministic) definition. These generalized
implicit definitions offer more expressive power than classical implicit
de,- nitions. Moreover, their expressive power can be characterized
over unordered finite structures in terms of the complexity class NP "
co-NP. Finally, we investigate a subclass of these where the
non-determinism is restricted to the choice of a unique relation with
respect to an implicit linear order, and prove that it captures UP "
co-UP also over the class of all finite structures. These results shed
some light on the expressive power of non-deterministic primitives.
Philippe Breche, M. Worner-
1995
In schema design for object database systems (ODBS) a designer
requires advanced facilities to change a schema through its life
cycle. Removing information embedded in a complex class hierarchy is
one of these advanced schema update primitives. Consequently to such a
schema change, databases might need updates, guided by designers'
requests. Most of the ODBSs disallow class deletions or restrict it to
leaf classes without objects or lose objects after the schema update
so far. The main contribution of this paper is to present a
declarative primitive, Remove, which copes with these requirements.
Serge Abiteboul, Cassio Souza Dos Santos-
1995
Object-oriented databases have brought major improvements in data
modeling by introducing notions such as inheritance or
methods. Extensions in many directions are now considered with
introductions of many concepts such as versions, views or roles. These
features bring the risk of creating monster data models with a number
of incompatible appendixes. We do not propose here any new extension
or any novel concept. We show more modestly that many of these
features can be formally and (we believe) cleanly combined in a
coherent manner.
Laurent Herr-
1995
Les requ
Michel Scholl, Philippe Rigaux-
1995
We study the impact of scale on data representation from both the
modelling and querying points of view. While our starting point was
geographical applications, statistical databases also address this
problem of data representation at various levels of abstraction. From
these requirements, we propose a model which allows: (i) database
querying without exact knowledge of the data abstraction level, (ii)
the computation of multiple representations of data, one per
abstraction level, and (iii) its application to the computation of
statistical summaries. The model has been partially implemented with
the DBMS O2 by means of tree-structured domains: we give some examples
which illustrate the above features.
Claude Delobel, Cassio Souza Dos Santos, Didier Tallot-
1995
This paper investigates the problem of integration between relational
and objectoriented databases. We discuss an approach based on class
and attribute mappings and show how OQL, the ODMG query language, can
be embedded in a set of primitives of a mapping language to serve as a
basis for object-relational data integration. We then describe the
technique used in the implementation of a prototype built on top of an
object-oriented view mechanism, which allows the construction of an
object view of a relational database.
Sophie Cluet, Guido Moerkotte-
1995
Producing optimal left-deep trees is known to be NP-complete for
general join graphs and a quite complex cost function counting disk
accesses for a special block-wise nested-loop join [2]. Independent of
any cost function is the dynamic programming approach to join
ordering. The number of alternatives this approach generates is known
as well [5]. Further, it is known that for some cost functions | those
fulfulling the ASI property [4] | the problem can be solved in
polynomial time for acyclic query graph, i.e., tree queries [2,
3]. Unfortunately, some cost functions like sort merge could not be
treated so far. We do so by a slight detour showing that this cost
function (and others too) are optimized if and only if the sum of the
intermediate result sizes is minimized. This validates the database
folklore that minimizing intermediate result sizes is a good
heuristic. Then we show that summarizing the intermediate result sizes
has the ASI property. It further motivates us to restrict the
subsequent investigations to this cost function called Cout for which
we show that the problem remains NP-complete in the general
case. Then, we concentrate on the main topic of the paper: the
complexity of producing left-deep processing trees possibly containing
cross products. Considering cross products is known to possibly result
in cheaper plans [5]. More specifically, we show that the problem
(LD-X-Star) of generating optimal left-deep processing trees possibly
containing cross products is NP-complete for star queries. Further, we
give an algorithm for treating star queries which is more e
Stephane Grumbach, Christophe Tollu-
1995
We investigate the expressive power of various extensions of
first-order, inductive, and infinitary logic with counting
quantifiers. We consider in particular a LOGSPACE extension of
first-order logic, and a PTIME extension of fixpoint logic with
counters. Counting is a fundamental tool of algorithms. It is
essential in the case of unordered structures. Our aim is to
understand the expressive power gained with a limited counting
ability. We consider two problems: (i) unnested counters, and (ii)
counters with no free variables. We prove a hierarchy result based on
the arity of the counters under the first restriction. The proof is
based on a game technique that is introduced in the paper. We also
establish results on the asymptotic probabilities of sentences with
counters under the second restriction. In particular, we show that
first-order logic with equality of the cardinalities of relations has
a 0/1 law.
Serge Abiteboul, Catriel Beeri-
1995
Various models and languages for describing and manipulating
hierarchically structured data have been proposed. Algebraic,
calculus-based and logic-programming oriented languages have all been
considered. This paper presents a general model for complex values
(i.e., values with hierarchical structures), and languages for it
based on the three paradigms. The algebraic language generalizes those
presented in the literature; it is shown to be related to the
functional style of programming advocated by Backus. The notion of
domain independence familiar from relational databases is defined, and
syntactic restrictions (referred to as safety conditions) on calculus
queries are formulated, that guarantee domain independence. The main
results are: The domain-independent calculus, the safe calculus, the
algebra, and the logic-programming oriented language have equivalent
expressive power. In particular, recursive queries, such as the
transitive closure, can be expressed in each of the languages. For
this result, the algebra needs the powerset operation. A more
restricted version of safety is presented, such that the restricted
safe calculus is equivalent to the algebra without the powerset. The
results are extended to the case where arbitrary functions and
predicates are used in the languages.
Victor Vianu-
1995
The paper presents a survey of the main formal rule-based languages
and semantics. Both procedural (fixpoint) and declarative
(model-theoretic) semantics are defined and discussed, including
inflationary and noninflationary fixpoint semantics, and the
semipositive, stratified and well-founded semantics. The relative
expressive power and complexity of the various languages are
provided. Nondeterministic rule-based languages are also discussed,
and it is shown how nondeterminism can circumvent some difficulties
concerning the expressive power of the deterministic
languages. Finally, languages with value invention (in the spirit of
object-creation in oodbs) are presented and issues of expressive power
specific to such languages are discussed.
Bernd Amann, Michel Scholl, Antoine Rizk-
1995
Modern hypertext applications require new system support for hypertext
authoring and user navigation through large sets of documents
connected by links. This system support must be based on advanced,
typed data models for describing the information structure in different
application domains. Schema based structuring through strongly typed
documents and links has already been proposed and put to practical use
in a multitude of hypertext applications. Systems such as Multicard/O2
and MORE have moreover exploited conceptual schemas for querying the
resulting hyperdocuments in a more structured way. In this paper, we
show how hypertext schemas and query languages can be utilized for
designing hypertext authoring and browsing environments for large
hypertexts. We illustrate our mechanisms using the Gram data model and
describe their implementation on top of the Multicard hypermedia
system connected to the O2 object-oriented database management system.
Victor Vianu, Philippe Picouet-
1995
A formal framework is introduced for studying the semantics and
expressiveness of active databases. The power of various abstract
trigger languages is characterized, and related to several major
active database prototypes such as ARDL, HiPAC, Postgres, Starburst,
and Sybase. This allows to formally compare the expressiveness of the
prototypes. The results provide insight into the programming paradigm
of active databases, the interplay of various features, and their
impact on expressiveness and complexity.
Philippe Breche, Fabrizio Ferrandina, Martin Kuklok-
1995
In this paper we concentrate on the description of change management
functionalities that have been added to the commercially available O2
ODBMS to yield a platform suited for applications such as SDEs. A
description of declarative high level schema update primitives is
presented as well as mechanisms to propagate schema changes to the
database. Particular emphasis is given to a mechanism that allows
designers first to simulate schema and database modifications by means
of object views, and second to materialize them to be propagated onto
the real schema and base.
Serge Abiteboul, Gerd Hillebrand-
1995
We consider evaluation strategies for database queries expressed in
three functional query languages: the complex value algebra, the
simply typed lambda calculus, and method schemas. Each of these query
languages derives its expressive power from a different primitive: the
complex value algebra from the powerset operator, the simply typed
lambda calculus from list iteration, and method schemas from
recursion. We show that "natural" evaluation strategies for these
primitives may lead to very inefficient space usage, but that with
some simple optimizations many queries can be evaluated with little or
no space overhead. In particular, we show: (1) In the complex value
algebra, all expressions with set nesting depth at most 2 can be
evaluated in pspace, and this set of expressions is sufficient to
express all queries in the polynomial hierarchy; (2) In the simply
typed lambda calculus with equality and constants, all query terms of
order at most 5 (where "query term" is a syntactic condition on types)
can be evaluated in pspace, and this set of terms expresses exactly
the pspace queries; (3) There exists a set of second-order method
schemas (with no simple syntactic characterization) that can be
evaluated in pspace, and this set of schemas is sufficient to express
all pspace queries.
Serge Abiteboul, Laurent Herr, Jan Van den Bussche-
1995
Some temporal query languages work directly on a timestamp
representation of the temporal database, while others provide a more
implicit access to the flow of time by means of temporal
connectives. We study the differences in expressive power between
these two approaches. We first consider first-order logic (i.e., the
relational calculus). We show that first-order future temporal logic
is strictly less powerful than first-order past-future temporal logic,
and that the latter is strictly less powerful than the relational
calculus with explicit timestamps. We also consider extensions of the
relational calculus with iteration constructs such as least fixpoints
or while-loops. We again compare augmentations of these languages with
temporal left and right moves on the one hand, and with explicit
timestamps on the other hand. For example, we show that a version of
fixpoint logic with left and right moves lies between the explicit
timestamp versions of first-order and fixpoint logic, respectively.
Stephane Grumbach, Fariza Tahi-
1994
Universal data compression algorithms fail to compress genetic
sequences. It is due to the specificity of this particular kind of
"text". We analyze in some details the properties of the sequences,
which cause the failure of classical algorithms. We then present a
lossless algorithm, biocompress-2, to compress the information
contained in DNA and RNA sequences, based on the detection of
regularities, such as the presence of palindromes. The algorithm
combines substitutional and statistical methods, and to the best of
our knowledge, lead to the highest compression of DNA. The results,
although not satisfactory, gives insight to the necessary correlation
between compression and comprehension of genetic sequences.
Sophie Cluet, Guido Moerkotte-
1994
Many declarative query languages for object-oriented (oo) databases
allow nested subqueries. This paper contains a complete classification
of oo nested queries and appropriate unnesting optimization strategies
based on algebraic rewriting. We adapt some known relational
techniques and introduce new ones that use and are concerned with
features specific to object-oriented queries. In particular, we
introduce two new and powerful grouping operators which will form the
basis for our unnesting techniques.
Serge Abiteboul, Victor Vianu-
1994
We study two important extensions of first-order logic (F O) with
iteration, the fixpoint and while queries. The main result of the
paper concerns the open problem of the relationship between fixpoint
and while: they are the same iff ptime = pspace. These and other
expressibility results are obtained using a powerful normal form for
while which shows that each while computation over an unordered domain
can be reduced to a while computation over an ordered domain via a
fixpoint query. The fixpoint query computes an equivalence relation on
tuples which is a congruence with respect to the rest of the
computation. The same technique is used to show that equivalence of
tuples and structures with respect to F O formulas with bounded number
of variables is definable in fixpoint.
Generalizing fixpoint and while, we consider more powerful languages
which model arbitrary computation interacting with a database using a
finite set of F O queries. Such computation is modeled by a relational
machine called Loosely Coupled Generic Machine, GMloose. GMloose
consists of a Turing Machine augmented with a finite set of
fixed-arity relations forming a relational store. The connection with
while is emphasized by a result showing that GMloose is equivalent to
while augmented with integer variables and arithmetic. The normal form
for while extends to these languages. We study the expressive power of
GMloose and its ptime and pspace restrictions. We argue that
complexity measures based on the size of the input may not be best
suited for database computation. Instead, we suggest an alternative
measure based on the discerning power of the machine, i.e. its ability
to distinguish between tuples given its
Cassio Souza Dos Santos-
1994
We discuss the design and implementation of the O2Views
object-oriented database view mechanism, which allows the redefinition
of both the structure and the behavior of objects stored in a
database. The data model extended with views is first given and then
the functionalities of the prototype implementing it are
presented. The paper focuses on the requirements for the
implementation of an object-oriented view mechanism, ranging from the
conception of a view definition language to optimization strategies
for querying and updating through a view such as view materialization
and consistency maintenance.
Stephane Grumbach, Jianwen Su-
1994
We study classes of infinite but finitely representable databases
based on constraints, motivated by new database applications such as
geographical databases. The mathematical framework is based on
classical decidable first-order theories. We investigate the theory of
finitely representable models and prove that it differs strongly from
both classical model theory and finite model theory. In particular, we
show that most of the well known theorems of either one fail
(compactness, completeness, locality, 0/1 laws, etc.). An immediate
consequence is the lack of tools to consider the definability of
queries in the relational calculus over finitely representable
databases. We illustrate this very challenging problem through some
classical examples.
Serge Abiteboul, Michel Scholl, Sophie Cluet, Vassilis Christophides-
1994
Structured documents (e.g., SGML) can benefit a lot from database
support and more specifically from object-oriented database (OODB)
management systems. This paper describes a natural mapping from SGML
documents into OODB's and a formal extension of two OODB query
languages (one SQL-like and the other calculus) in order to deal with
SGML document retrieval.
Although motivated by structured documents, the extensions of query
languages that we present are general and useful for a variety of
other OODB applications. A key element is the introduction of paths as
first class citizens. The new features allow to query data (and to
some extent schema) without exact knowledge of the schema in a simple
and homogeneous fashion.
Vassilis Christophides, Antoine Rizk-
1994
Hierarchical logical structure and hypertext links are complementary
and can be combined to build more powerful document management systems
[28, 25, 24, 13]. Previous work exploits this complementarity for
building better document processors, browsers and editing tools, but
not for building sophisticated querying mechanisms. Querying in
hypertext has been a requirement since [19] and has already been
elaborated in many hypertext systems [11, 7, 4, 21], but has not yet
been used for hypertext systems superimposed on an underlying
hierarchical logical structure. In this paper we use the model and the
SQL-like query language of [10] in order to manage structured
documents with hypertext links. The model represents a structured
document with typed links as a complex object, and uses paths through
the document structure, as first class citizens in formulating
queries. Several examples of queries illustrate, from a practical
point of view, the expressive power of the language to retrieve
documents, even without exact knowledge of their structure in a simple
and homogeneous fashion. It must be stressed that the proposed model
and language implement the equivalent HyTime [1] Location Address
Module. In fact, the language is more powerful than the corresponding
HyQ query facilities. The implementation and the description
throughout the paper use the SGML standard [2] to represent the
document structure and the object-oriented DBMS O2 [12] to implement
the query language and the storage module.
Bernd Amann, Michel Scholl, Antoine Rizk-
1994
Due to the growing complexity of modern hypertext applications,
current hypertext systems require new mechanisms to support authoring
and user navigation through large sets of documents connected by
links. A general solution is to extend hypertext systems to cater for
semantics of application domains. This requires new hypertext models
providing strongly typed documents and links. Such models have been
proposed and put to use in systems such as HDM and MacWeb to
facilitate authoring of large hypertexts. In addition, Gram and MORE
use typing and graph-based hypertext schemas for querying
hyperdocuments. In this paper, we will show how query languages could
be further exploited for designing sophisticated general query-based
navigation mechanisms. We illustrate our examples using the Gram model
and describe an implementation with the hypermedia system Multicard
connected to the objectoriented database management system O2.
Serge Abiteboul, Victor Vianu, Christos H. Papadimitriou-
1994
A model of database programming with reflection, called reflective
relational machine is introduced and studied. The reflection consists
here of dynamic generation of queries in a host programming
language. The main results characterize the power of the machine in
terms of known complexity classes. In particular, the PTIME
restriction of the machine is shown to express PSPACE, and to
correspond precisely to uniform circuits of polynomial depth and
exponential size. This provides an alternative, logicbased formulation
of the uniform circuit model, more convenient for problems naturally
formulated in logic terms. Since time in the polynomially-bounded
machine coincides with time in the uniform circuit model, this also
shows that reflection allows for more "intense" parallelism, not
attainable otherwise. Other results concern the power of the
reflective relational machine subject to restrictions on the number of
variables used.
Serge Abiteboul, Claude Delobel, Cassio Souza Dos Santos-
1994
We propose the notions of virtual schemas and virtual bases as a
coherent way of integrating various features in OODB views. A virtual
schema is defined based on some existing (real) schema. A virtual base
is obtained when a (real) base is attached to a virtual schema. We
study the consequences of this simple assumption. In particular, we
observe the differences between a real schema and a virtual one. We
also consider an extension (that we call generic schemas) where it is
necessary to specify several real bases to attach data to a virtual
schema. We show how the flexibility provided by virtual schemas can be
used to cope with various dynamic features of database systems.
Stephane Grumbach, Christophe Tollu, Guy Fayolle-
1993
We study the impact of adding certain families of generalized
quantifiers to first-order logic (FO) on the asymptotic behavior of
sentences. All our results are stated and proved for languages
disallowing free variables in the scope of generalized quantifiers. For
a class K of finite structures closed under isomorphism, the quantifier
QK is said to be strongly monotonic, sm, if membership in the class is
preserved under a loose form of extensions. Our first theorem (0/1 law
for FO with any set of sm quantifiers) subsumes a previous criterion
for proving that almost no graphs satisfy a given property [BH79]. We
also establish a 0/1 law for FO with H-artig quantifiers
(equicardinality quantifiers) and a limit law for a fragment of FO with
Rescher quantifiers (expressing inequalities of cardinalities). The
proofs of these last two results combine standard combinatorial
enumerations with more sophisticated techniques from complex
analysis. We also prove that the 0/1 law fails for the extension of FO
with H-artig quantifiers if the above syntactic restriction is
relaxed. We therefore get the best upper bound for the existence of a
0/1 law for FO with H-artig quantifiers.
Stephane Grumbach, Fariza Tahi-
1993
We propose a lossless algorithm to compress the information contained
in DNA sequences. None of the available universal algorithms compress
such data. This is due to the specificity of genetic information. Our
method is based on regularities, such as the presence of palindromes,
in the DNA. The results we obtain, although not satisfactory, are far
beyond classical algorithms.
Serge Abiteboul, Victor Vianu-
1993
One of the exciting developments in complexity theory is the discovery
of a very intimate connection between computational complexity and
logic. This intimate connection was first discovered by Fagin, who
showed that the complexity class NP coincides with the class of
properties expressible in existential 2nd-order logic [Fag74]
(cf. [JS74]). Another demonstration of this connection was shown by
Immerman and Vardi, who discovered tight relationships between the
complexity class p and inflationary fixpoint logic [Imm86, Var82] and
between the class pspace and noninflationary fixpoint logic. This
initiated, during the 1980s, a new branch of complexity theory, which
focuses on the descriptive complexity of problems, i.e. the complexity
of describing problems in some logical formalism [Imm87b].
Bernd Amann, Michel Scholl-
1993
In [AS92] a database model was defined which uses regular expressions
over data types to formulate queries. We discuss its expressive power
and utilize it as an underlying data model for designing general
database navigation mechanisms. The application we have in mind is
hypertext navigation and querying. However we believe that these
mechanisms are also applicable on top of other data models such as the
relational or object-oriented database models.
Bernd Amann, Michel Scholl, Vassilis Christophides-
1993
We describe an integration of the hypermedia system HyperPATH with the
object-oriented DBMS (OODBMS) O2. Providing persistence to a hypertext
system was the first motivation of this work. More generally, we were
interested in a better understanding of the connection of hypertext
systems with OODBMS. One of our goals was to define an abstract
interface between HyperPATH and a variety of OODBMS's. The solution
adopted shows that opening HyperPATH to different OODBMS's is not
possible without major rewriting of existing C++ code.
Sophie Cluet, Guido Moerkotte-
1993
Many declarative query languages for object-oriented databases allow
nested subqueries. This paper contains the first (to our knowledge)
proposal to optimize them. A two-phase approach is used to optimize
nested queries in the object-oriented context. The first phase--called
dependency-based optimization--transforms queries at the query
language level in order to treat common subexpressions and independent
subqueries more efficiently. The transformed queries are translated to
nested algebraic expressions. These entail nested loop evaluation
which may be very inefficient. Hence, the second phase unnests nested
algebraic expressions to allow for more efficient evaluation.
Serge Abiteboul, Sophie Cluet, Tova Milo-
1993
We show how structured data stored in files can benefit from standard
database technology and in particular be queried and updated using
database languages. We introduce the notion of structuring schema
which consists of a grammar annotated with database programs and of a
database schema. We study the translation from structured strings to
databases, and the converse. We adapt optimization techniques from
relational databases to our context.
Stephane Grumbach, Tova Milo-
1993
Bags, i.e. sets with duplicates, are often used to implement relations
in database systems. In this paper, we study the expressive power of
algebras for manipulating bags. The algebra we present is a simple
extension of the nested relation algebra. Our aim is to investigate
how the use of bags in the language extends its expressive power, and
increases its complexity. We consider two main issues, namely (i) the
impact of the depth of bag nesting on the expressive power, and (ii)
the complexity and the expressive power induced by the algebraic
operations. We show that the bag algebra is more expressive than the
nested relation algebra (at all levels of nesting), and that the
difference may be subtle. We establish a hierarchy based on the
structure of algebra expressions. This hierarchy is shown to be highly
related to the properties of the powerset operator.
Sophie Cluet, Claude Delobel-
1992
The goal of this work is to integrate in a general framework the
different query optimization techniques that have been proposed in the
object-oriented context. As a first step, we focus essentially on the
logical aspect of query optimization. In this paper, we propose a
formalism (i) that unifies different rewriting formalisms, (ii) that
allows easy and exhaustive factorization of duplicated subqueries, and
(iii) that supports heuristics in order to reduce the optimization
rewriting phase.
Bernd Amann, Michel Scholl-
1992
We present a model for data organized as graphs with typed nodes and
edges. Regular expressions over the types of the node and edge labels
are used to select walks in the hypertext graph. An outline of the
application of this model and its query language to the implementation
of hypertext documents is given by using an extended example of a
travel agency application.
Serge Abiteboul, Victor Vianu, Moshe Y. Vardi-
1992
Most recursive extensions of relational calculus converge around two
central classes of queries: fixpoint and while. Infinitary logic (with
finitely many variables) is a very powerful extension of these
languages which provides an elegant unifying formalism for a wide
variety of query languages. However, neither the syntax nor the
semantics of infinitary logic are effective, and its connection to
practical query languages has been largely unexplored. We relate
infinitary logic to another powerful extension of fixpoint and while,
called relational machine, which highlights the computational style of
these languages. Relational machines capture the kind of computation
occurring when a query language is embedded in a host programming
language, as in C+SQL. The main result of this paper is that
relational machines correspond to the natural effective fragment of
infinitary logic. Other well-known query languages are related to
infinitary logic using syntactic restrictions formulated in
language-theoretic terms. For example, it is shown that while
corresponds to infinitary logic formulas which can be described by a
regular language. As a side effect to these results, we obtain
interesting normal forms for infinitary logic formulas.
Serge Abiteboul-
1992
These notes were written for a tutorial in JICSLP (Washington, 1992)
and are not intended to be a comprehensive survey of the field.
We are concerned with dooD, i.e. deductive object-oriented databases
(systems). We briefly review ooD systems (object-oriented database
management systems) and deductive database systems (the deductive
ones). A main purpose is to try to clarify these concepts. As we shall
see, although there is already a conference entitled dooD and a large
body of literature on the topic, the meaning of dooD is not yet
clear. The issues are not precise and we are still looking forward to
the first real implementation of a dooD system. We will try to present
the motivations, the main achievements so far, and the open questions.
Serge Abiteboul, Victor Vianu, Moshe Y. Vardi-
1992
We establish a general connection between fixpoint logic and
complexity. On one side, we have fixpoint logic, parameterized by the
choices of 1st-order operators (inflationary or noninflationary) and
iteration constructs (deterministic, nondeterministic, or
alternating). On the other side, we have the complexity classes
between P and EXPTIME. Our parameterized flxpoint logics capture the
complexity classes P, NP, PSPACE, and EXPTIME, but equality is
achieved only over ordered structures.
There is, however, an inherent mismatch between complexity and logic {
while computational devices work on encodings of problems, logic is
applied directly to the underlying mathematical structures. To
overcome this mismatch, we develop a theory of relational complexity,
which bridges tha gap between standard complexity and fixpoint
logic. On one hand, we show that questions about containments among
standard complexity classes can be translated to questions about
containments among relational complexity classes. On the other hand,
the expressive power of fixpoint logic can be precisely characterized
in terms of relational complexity classes. This tight three-way
relationship among fixpoint logics, relational complexity and standard
complexity yields in a uniform way logical analogs to all containments
among the complexity classes P, NP, PSPACE, and EXPTIME. The logical
formulation shows that some of the most tantalizing questions in
complexity theory boil down to a single question: the relative power
of inflationary vs. noninflationary 1st-order operators.
Bernd Amann, Michel Scholl-
1992
We present a model for data organized as graphs. Regular expressions
over the types of the node and edge labels are used to qualify
connected subgraphs. An algebraic language based on these regular
expressions and supporting a restricted form of recursion is
introduced. A natural application of this model and its query language
is hypertext querying.
Serge Abiteboul, Allen Van Gelder-
1992
A method to perform nonmonotonic relational rule computations is
presented, called the split technique, The goal is to avoid redundant
computations with rules that can insert and delete sets of tuples
specified by the rule body. The method is independent of the control
strategy that governs rule firing. Updatable relations are
partitioned, as the computation progresses, into blocks of tuples such
that tuples within a block are indiscernible from each other based on
the computation so far. Results of previous rule firings are
remembered as "relational equations" so that a new rule firing does
not recompute parts of the result that can be determined from the
existing equations. Seminaive evaluation falls out as a special case
when all rules specify inserts. The method is amenable to
parallelization.
Serge Abiteboul, Victor Vianu, Kevin Compton-
1992
The optimization of a large class of queries is explored, using a
powerful normal form recently proven. The queries include the fixpoint
and while queries, and an extension of while with arithmetic. The
optimization method is evaluated using a probabilistic analysis. In
particular, the average complexity of fixpoint and while is considered
and some surprising results are obtained. They suggest that the
worst-case complexity is sometimes overly pessimistic for such
queries, whose average complexity is often much more reasonable than
the provably rare worst case. Some computational properties of queries
are also investigated. A probabilistic notion of boundedness is
defined, and it is shown that all programs in the class considered are
bounded almost everywhere. An effective way of using this fact is
provided.
Serge Abiteboul, Victor Vianu-
1991
Recent research on query languages and their expressive power is
discussed. Several query languages are described, emphasizing
recursive extensions of the first-order queries based on three
paradigms: logic, algebraic, and logic programming. Many of these
languages converge around two central classes of queries, the fixpoint
queries and the while queries. The relative expressive power of these
languages is examined, as well as their connection to complexity
classes of queries. The focus is on the inability of languages to
express low complexity classes of queries, ptime and below. We
consider several ways to circumvent this difficulty. The first is to
introduce an ordering of the domain which leads to a trade-off between
complexity and the data independence principle. The cost of computing
without order is formalized by defining non-standard complexity
classes based on a model of database computation called Generic
Machine. As an alternative to order, it is shown that the difficulty
of expressing low complexity classes can be circumvented by
introducing non-deterministic constructs in query
languages. Expressiveness above np is also discussed.
Serge Abiteboul, Eric Simon-
1991
Fundamental properties of deterministic and nondeterministic
extensions of Datalog from [AV88] are studied. The extensions involve
the use of negative literals both in bodies and heads of
rules. Negative literals in heads are interpreted as deletions. A
deterministic semantics is obtained by firing in parallel all
applicable rules. The nondeterministic semantics results from firing
(nondeterministically) one rule at a time. In the nondeterministic
case, programs do not describe functions but relations between
database states. In both cases, the result is an increase in
expressive power over Datalog. The price for it is that programs do
not always terminate. We study when a program (i) is such that on a
given input, all its successful computations reach a unique fixpoint,
(ii) yields at least one output on every input and (iii) has only
loop-free computations. We also show how to simulate programs
containing loops by loop-free programs.
Serge Abiteboul, Victor Vianu-
1991
There is a fundamental mismatch between the hardness of database
queries and their Turing complexity. For example, the even query on a
set has low Turing complexity but is by all accounts a hard query. The
mismatch is due to the abstract, generic nature of database
computation: data items which are indistinguishable by logical
properties are treated uniformly. The issues specific to generic
computation are obscured in the Turing model. Two models of generic
computation are proposed. They are extensions of Turing Machines with
a relational store. The machines differ in the interaction between the
Turing component and the relational store. The first, simpler machine,
allows limited communication with the relational store. However, it is
not complete. Nonetheless, it subsumes many query languages which have
emerged as central, including the fixpoint and while queries. We prove
a normal form for the machine, which provides a key technical
tool. The normal form specialized to the while queries allows
resolving the open problem of the relation of fixpoint and while: they
are equivalent i, ptime = pspace. The second machine allows extended
communication between the relational store and the Turing component,
and is complete. The machine involves parallelism. Based on it, we
define complexity measures for generic mappings. We focus on notions of
polynomial time and space complexity for generic mappings, and show
their robustness. The results point to a trade-o, between complexity
and computing with an abstract interface.
Serge Abiteboul, Anthony Bonner-
1991
Object-oriented databases have been introduced primarily to ease the
development of database applications. However, the difficulties
encountered when, for instance, trying to restructure data or
integrate databases demonstrate that the models being used still lack
flexibility. We claim that the natural way to overcome these
shortcomings is to introduce a sophisticated view mechanism. This
paper presents such a mechanism, one which allows a programmer to
restructure the class hierarchy and modify the behavior and structure
of objects. The mechanism allows a programmer to specify attribute
values implicitly, rather than storing them. It also allows him to
introduce new classes into the class hierarchy. These virtual classes
are populated by selecting existing objects from other classes and by
creating new objects. Fixing the identify of new objects during
database updates introduces subtle issues into view design. Our
presentation, mostly informal, leans on a number of illustrative
examples meant to emphasize the simplicity of our mechanism.
Serge Abiteboul, Paris Kanellakis, Gosta Grahne-
1991
We represent a set of possible worlds using an incomplete information
database. The representation techniques that we study range from the
very simple Codd-table (a relation over constants and uniquely
occurring variables called nulls) to much more complex mechanisms
involving views of conditioned-tables (programs applied to Coddtables
augmented by equality and inequality conditions). (1) We provide
matching upper and lower bounds on the data-complexity of testing
containment, membership, and uniqueness for sets of possible
worlds. We fully classify these problems with respect to our
representations. (2) We investigate the data-complexity of querying
incomplete information databases for both possible and certain
facts. For each fixed positive existential query on conditioned-tables
we present a polynomial time algorithm solving the possible fact
problem. We match this upper bound by two NP-completeness lower
bounds, when the fixed query contains either negation or recursion and
is applied to Codd-tables. Finally, we show that the certain fact
problem is coNP-complete, even for a fixed first order query applied
to a Codd-table.
Serge Abiteboul-
1990
A language for databases with sets, tuples, lists, object identity and
structural inheritance is proposed. The core language is logic-based
with a fixpoint semantics. Methods with overloading and methods
evaluated externally providing extensibility of the language are
considered. Other important issues such as updates and the
introduction of explicit control are discussed.
Serge Abiteboul, Catriel Beeri-
1988
Various models and languages for describing and manipulating
hierarchically structured data have been proposed. Algebraic,
calculus-based and logic-programming oriented languages have all been
considered. This paper presents a general model for complex objects,
and languages for it based on the three paradigms. The algebraic
language generalizes those presented in the literature; it is shown to
be related to the functional style of programming advocated by
Backus. The notion of domain independence familiar from relational
databases is defined, and syntactic restrictions (referred to as
safety conditions) on calculus queries are formulated, that guarantee
domain independence. The main results are: The domain-independent
calculus, the safe calculus, the algebra, and the logic-programming
oriented language have equivalent expressive power. In particular,
recursive queries, such as the transitive closure, can be expressed in
each of the languages. For this result, the algebra needs the powerset
operation. A more restricted version of safety is presented, such that
the restricted safe calculus is equivalent to the algebra without the
powerset. The results are extended to the case where arbitrary
functions and predicates are used in the languages.
Serge Abiteboul, Victor Vianu-
1991
The use of non-determinism in logic-based languages is motivated using pragmatic and theoretical considerations. Non-deterministic database queries and updates occur naturally, and there exist non-deterministic implementations of various languages. It is shown that non-determinism resolves some difficulties concerning the expressive power of deterministic languages: there are non-deterministic languages expressing low complexity classes of queries/updates, whereas no such deterministic languages are known. Various mechanisms yielding non-determinism are reviewed. The focus is on two closely related families of non-deterministic languages. The first consists of extensions of Datalog with negations in bodies and/or heads of rules, with non-deterministic fixpoint semantics. The second consists of non-deterministic extensions of first-order logic and fixpoint logics, using thewitness operator. The expressive power of the languages is characterized. In particular, languages expressing exactly the (deterministic and non-deterministic) queries/updates computable in polynomial time are exhibited, whereas it is conjectured that no analogous deterministic language exists. The connection between non-deterministic languages and determinism is also explored. Several problems of practical interest are examined, such as checking (statically or dynamically) if a given program is deterministic, detecting coincidence of deterministic and non-deterministic semantics, and verifying termination for non-deterministic programs.