Alignement d'ontologies base sur des ressources complementaires : illustration sur TaxoMap

Chantal Reynaud, Brigitte Safar- IEEE Transactions on Knowledge and Data Engineering 2009

Pour identifier des mappings entre les concepts de deux ontologies, de nombreux travaux récents portent sur l'utilisation de connaissances complémentaires dites de "background" ou de support, représentés sous la forme d'une mème ontologie. Leur objectif commun est de compléter les techniques classiques d'appariement qui exploitent la structure ou la richesse du langage de représentation des ontologies, et qui ne s'appliquent plus quand les ontologies à apparier sont faiblement structurées ou se limitent à de simples taxonomies. Cet article comporte deux parties. La première présente une étude de différents travaux utilisant des connaissances de support, en commençant par leur schéma général commun, suivi par une analyse des travaux en fonction du type de connaissance de support utilisée. Une seconde partie est consacrée au système d'alignement TaxoMap. Nous présentons le système et son contexte d'utilisation. Nous décrivons ensuite l'utilisation de WordNet comme connaissance de support ainsi que les résultats d'expérimentation obtenus.

An Elementary Chromatic Reduction for Gain Graphs and Special Hyperplane Arrangements

Veronique Ventos, Raul Cordovil, David Forge, Thomas Zaslavsky- IEEE Transactions on Knowledge and Data Engineering 2009 September - 16(1 (R121)):1 - - 31

A gain graph is a graph whose edges are labelled invertibly by gains from a group. Switching is a transformation of gain graphs that generalizes conjugation in a group. A weak chromatic function of gain graphs with gains in a fixed group satisfies three laws: deletion-contraction for links with neutral gain, invariance under switching, and nullity on graphs with a neutral loop. The laws are analogous to those of the chromatic polynomial of an ordinary graph, though they are different from those usually assumed of gain graphs or matroids. The three laws lead to the weak chromatic group of gain graphs, which is the universal domain for weak chromatic functions. We find expressions, valid in that group, for a gain graph in terms of minors without neutral-gain edges, or with added complete neutral-gain subgraphs, that generalize the expression of an ordinary chromatic polynomial in terms of monomials or falling factorials. These expressions imply relations for all switching-invariant functions of gain graphs, such as chromatic polynomials, that satisfy the deletion-contraction identity for neutral links and are zero on graphs with neutral loops. Examples are the total chromatic polynomial of any gain graph, including its specialization the zero-free chromatic polynomial, and the integral and modular chromatic functions of an integral gain graph. We apply our relations to some special integral gain graphs including those that correspond to the Shi, Linial, and Catalan arrangements, thereby obtaining new evaluations of and new ways to calculate the zero-free chromatic polynomial and the integral and modular chromatic functions of these gain graphs, hence the characteristic polynomials and hypercubical lattice-point counting functions of the arrangements. The proof involves gain graphs between the Catalan and Shi graphs whose polynomials are expressed in terms of descending-path vertex partitions of the graph of (-1)-gain edges. We also calculate the total chromatic polynomial of any gain graph and especially of the Catalan, Shi, and Linial gain graphs.

Combining a Logical and a Numerical Method for Data Reconciliation

Nathalie Pernelle, Marie-Christine Rousset, Fatiha Sais- IEEE Transactions on Knowledge and Data Engineering 2009 - 12:66,94

Concept Learning from (Very) Ambiguous Examples

Veronique Ventos, Henry Soldano, Dominique Bouthinon- IEEE Transactions on Knowledge and Data Engineering 2009 July - 5632:465- -478

Construction automatique d adaptateurs guidee par une ontologie

Chantal Reynaud and Brigitte Safar- IEEE Transactions on Knowledge and Data Engineering 2009 - 28(2)

Le travail décrit dans ce papier porte sur l'intégration de sources d'informations hétérogènes XML au sein d'un serveur d'information selon une approche mixte combinant médiation et entrepôt de données. Ce serveur dispose d'un schéma, ou ontologie, utilisé pour l'accès tant aux sources externes qu'aux données locales. La méthode que nous proposons est une méthode unifiée qui s'appuie sur l'ontologie et permet de réaliser à la fois l'intégration de sources et de données. Notre contribution est double. Elle porte d'une part sur la génération automatisée de mises en correspondance, ou mappings, entre l'ontologie et une nouvelle source à intégrer, d'autre part sur la construction automatique d'adaptateurs (wrappers en anglais) allant de la description du contenu abstrait de cette nouvelle source jusqu'à l'extraction des données. Des expérimentations ont été réalisées sur des données réelles dans le domaine du tourisme.

Data Extraction, Transformation and Integration guided by an Ontology

Nathalie Pernelle, Marie-Christine Rousset, Chantal Reynaud, Brigitte Safar, Fatiha Sais- IEEE Transactions on Knowledge and Data Engineering 2009 August

This chapter deals with integration of XML heterogeneous sources into a data warehouse with data defined in terms of a global abstract schema or ontology. The authors present an approach supporting the acquisition of data from a set of external sources available for an application of interest including data extraction, data transformation and data integration or reconciliation. The integration middleware that the authors propose extracts data from external XML sources which are relevant according to an RDFS+ ontology, transforms returned XML data into RDF facts conformed to the ontology and reconciles RDF data in order to resolve possible redundancies.

DL-liteR in the Light of Propositional Logic for Decentralized Data Management

Marie-Christine Rousset, Francois Goasdou, Nada Abdallah- IEEE Transactions on Knowledge and Data Engineering 2009

This paper provides a decentralized data model and associated algorithms for peer data management systems (PDMS) based on the DL-liteR description logic. Our approach relies on reducing query reformulation and consistency checking for DL-liteR into reasoning in propositional logic. This enables a straightforward deployment of DL-liteR PDMSs on top of SomeWhere, a scalable propositional peer-to-peer inference system. We also show how to use the state-of-the-art Minicon algorithm for rewriting queries using views in DL-liteR in the centralized and decentralized cases.

Efficient Maintenance Techniques for Views over Active Documents

Serge Abiteboul, Bogdan Marinoiu, Pierre Bourhis- IEEE Transactions on Knowledge and Data Engineering 2009

Many Web applications are based on dynamic interactions between Web components exchanging flows of information. Such a situation arises for instance in mashup systems or when monitoring distributed autonomous systems. Our work is in this challenging context that has generated recently a lot of attention; see Web 2.0. We introduce the axlog formal model for capturing such interactions and show how this model can be supported efficiently. The central component is the axlog widget defined by one tree-pattern query or more, over an active document (in the Active XML style) that includes some input streams of updates. A widget generates a stream of updates for each query, the updates that are needed to maintain the view corresponding to the query. We exploit an array of known technologies: datalog optimization techniques such as Differential or MagicSet, constraint query languages, and efficient XML filtering (YFilter). The novel optimization technique we propose is based on fundamental new notions: a relevance (different than that of MagicSet), satisfiability and provenance for active documents. We briefly discuss an implementation of an axlog engine, an application that we used to test the approach, and results of experiments.

MESAM: A PROTEGE plug-in for the Specialization of Models

Chantal Reynaud, Yolaine Bourda, Fabrice Popineau, Nadjet Zemirline- IEEE Transactions on Knowledge and Data Engineering 2009

Nowadays, several efforts are focused on re-using generic platforms to create new systems, in order to make the design process easier and faster. Often, the designer has his own models and resources and would like to reuse the generic system over his resources. That means, he has to integrate his models and resources in the system, and then to directly reuse the generic system. But many problems occur. One of them is that the designer needs to translate his models into the specific format that understood by the system and to use the vocabulary specific to that system. Furthermore, he also needs to translate all the instantiations of his models (i.e. the resources and their metadata). We think that this task is tedious and time-consuming and we want to avoid it. Our objective is to allow the designer to reuse his models (his vocabulary) and his models' instantiations without any change of format or vocabulary. For example, a generic Adaptive Hypermedia System (AHS) is made of a generic adaptation model relying on generic user and domain models. The designer would like to integrate his models and instances in the generic models in order to reuse the generic adaptation engine. Specific systems can be obtained by specializing the generic models. However, this specialization process is not always easy to perform. It has to be supported to make the design process easier and faster. This paper focuses on assisting designers to specialize generic models using their own models. We aim to automate this process which has been so far entirely manual. Our objectives are twofold: to create a support for defining mappings between elements in generic models and elements in the designer's personal models and to help creating consistent and relevant models integrating the generic and specific ones and taking into account the mappings between them. The proposed approach relies on OWL1, a W3C standard and SWRL2, a W3C proposal.

Partitionnement d'ontologies pour le passage a l'echelle des techniques d'alignement

Chantal Reynaud, Brigitte Safar, Haifa Zargayouna, Faycal Hamdi- IEEE Transactions on Knowledge and Data Engineering 2009 Janvier:409-420

L'alignement d'ontologies est une tâche importante dans les systèmes d'intégration puisqu'elle autorise la prise en compte conjointe de ressources décrites par des ontologies différentes, en identifiant des appariements entre concepts. Avec l'apparition de très grandes ontologies dans des domaines comme la médecine ou l'agronomie, les techniques d'alignement, qui mettent souvent en oeuvre des calculs complexes, se trouvent face à un défi : passer à l'échelle. Pour relever ce défi, nous proposons dans cet article deux méthodes de partitionnement, conçues pour prendre en compte, le plus tôt possible, l'objectif d'alignement. Ces méthodes permettent de décomposer les deux ontologies à aligner en deux ensembles de blocs de taille limitée et tels que les éléments susceptibles d'être appariés se retoruvent concentrés dans un ensemble minimal de blocs qui seront effectivement comparés. Les résultats des tests effectués avec nos deux méthodes sur différents couples d'ontologies montrent leur efficacité.

Reasoning about XML update constraints

Serge Abiteboul, Tova Milo, Bogdan Cautis- IEEE Transactions on Knowledge and Data Engineering 2009

We introduce in this paper a class of constraints for describing how an XML document can evolve, namely XML update constraints. For these constraints, we study the implication problem, giving algorithms and complexity results for constraints of varying expressive power. Besides classical constraint implication, we also consider an instance-based approach in which we take into account data. More precisely, we study implication with respect to a current tree instance, resulting from a series of unknown updates. The main motivation of our work is reasoning about data integrity under update restrictions in contexts where owners may lose control over their data, such as in publishing or exchange.

Satisfiability and relevance for query over active documents

Serge Abiteboul, Bogdan Marinoiu, Pierre Bourhis- IEEE Transactions on Knowledge and Data Engineering 2009

The AXML Artifact Modely

Serge Abiteboul, Bogdan Marinoiu, Pierre Bourhis, Alban Galland- IEEE Transactions on Knowledge and Data Engineering 2009

Towards a data-centric workflow approach, we introduce an artifact model to capture data and workflow management activities in distributed settings. The model is built on Active XML, i.e., XML trees including Web service calls. We argue that the model captures the essential features of business artifacts as described informally in [1] or discussed in [2]. To illustrate, we briefly consider the monitoring of distributed systems and the verification of temporal properties for them.

A Pattern and Rule-based Approach for Reusing Adaptive Hypermedia Creator

Chantal Reynaud, Yolaine Bourda, Fabrice Popineau, Nadjet Zemirline- IEEE Transactions on Knowledge and Data Engineering 2008

The design of Adaptive Hypermedia is a difficult task which can be made easier if generic systems and AH creators' models are reused. We address the design problem in the setting of the GLAM platform only made up of generic components. In this paper, we assume the GLAM platform is used to create a specific adaptive hypermedia. We present a pattern and a rule-based approach helping a AH creator in reusing its user and domain models and instances in order to make them taken into account. This semi-automatic approach takes the creators' models as specialisations of GLAM generic models and requires the creator to express a minimum set of mappings between his models and the generic ones. The process results in a merged model consisting of the generic and the corresponding specific model, being fully compliant with the GLAM adaptation model. A plug-in and experimentations in the e-learning domain have been partially designed.

A probabilistic trust model for semantic peer to peer systems

Marie-Christine Rousset, Philippe Chatalic, Gia-Hien Nguyen- IEEE Transactions on Knowledge and Data Engineering 2008 march:59-65

Semantic peer to peer (P2P) systems are fully decentralized overlay networks of people or machines (called peers) sharing and searching varied resources (documents, videos, photos, data, services) based on their semantic annotations using ontologies. They provide a support for the emergence of open and decentralized electronic social networks, in which no central or external authority can control the reliability of the peers participating to the network. In this paper, we propose a probabilistic model to handle trust in a P2P setting. It supports a local computation and a simple form of propagation of the trust of peers into classes of other peers. We claim that it is well appropriate to the dynamics of P2P networks and to the freedom of each peer within the network to have different viewpoints towards the peers with which it interacts.

A probabilistic trust model for semantic peer to peer systems

Marie-Christine Rousset, Philippe Chatalic, Gia-Hien Nguyen- IEEE Transactions on Knowledge and Data Engineering 2008 july:881-882

Semantic peer to peer (P2P) systems are fully decentralized overlay networks of people or machines (called peers) sharing and searching varied resources (documents, videos, photos, data, services) based on their semantic annotations using ontologies. They provide a support for the emergence of open and decentralized electronic social networks, in which no central or external authority can control the reliability of the peers participating to the network. In this paper, we propose a probabilistic model to handle trust in a P2P setting. It supports a local computation and a simple form of propagation of the trust of peers into classes of other peers. We claim that it is well appropriate to the dynamics of P2P networks and to the freedom of each peer within the network to have different viewpoints towards the peers with which it interacts.

Active documents: satisfiability of queries and view maintenance

Serge Abiteboul, Bogdan Marinoiu, Pierre Bourhis- IEEE Transactions on Knowledge and Data Engineering 2008 March

We consider the evolution of active documents, and, more precisely, of Active XML documents with incoming streams of data. A main contribution of the paper is a study of "document satisfiability" defined as follows. A formula is satisfiable for a document if it may hold in some future (depending on data brought by the incoming streams). We show how to evaluate document satisfiability of tree-pattern queries using non-recursive datalog. We characterize the complexity of this problem for important variants of the document model and the query language. We use this notion together with known datalog techniques (Differential and MagicSet) for maintaining views over active documents. We also exhibit optimization techniques that consist in pushing filters to the incoming streams and garbage collecting data and streams that are useless with respect to the views of interest.

Assisting in Reuse of Adaptive Hypermedia Creator's Models

Chantal Reynaud, Yolaine Bourda, Fabrice Popineau, Nadjet Zemirline- IEEE Transactions on Knowledge and Data Engineering 2008 July

The design of Adaptive Hypermedia is a difficult task which can be made easier if generic systems and AH creators' models are reused. We address this design problem in the setting of the GLAM platform only made up of generic components. We present a rule-based approach helping an AH creator in reusing its user and domain models to create a specific adaptive hypermedia. This semi-automatic approach takes the creators' models as specialisations of GLA generic models and requires the creator to express a minimum set of mappings between his models and the generic ones. The process results in a merged model consisting of the generic and the corresponding specific model. This merged model can be used by the adaptation model.

Automatic discovery of similar words

Pierre Senellart, Vincent Blondel- IEEE Transactions on Knowledge and Data Engineering 2008 jan

Calcul de consequences dans un systeme d ?inference pair-a-pair propositionnel (revisite)

Francois Goasdou, Nada Abdallah- IEEE Transactions on Knowledge and Data Engineering 2008

Dans cet article, nous étudions le calcul de conséquences dans les systèmes d'inférence pair-à-pair (P2PIS) propositionnels avec des mappings orientés. Dans ces systèmes, un mapping allant d'un pair vers un autre spécifie un ensemble de connaissances que le premier pair doit observer, ainsi que les connaissances qu'il doit notifier au second pair si les connaissances observées sont satisfaites. Ces nouveaux P2PIS pouvant modéliser de nombreuses applications réelles, il est important de les doter d'inférences clés de l'IA, en l'occurrence du calcul de conséquences. Nos contributions sont doubles. Nous définissons tout d'abord le premier cadre logique pour représenter des P2PIS propositionnels avec des mappings orientés. Nous étudions ensuite le calcul de conséquences dans ce nouveau cadre. En particulier, nous proposons un algorithme totalement décentralisé pour ce problème.

Calcul de consequences pour le test d extension conservative dans un systeme pair-a-pair

Francois Goasdou, Nada Abdallah- IEEE Transactions on Knowledge and Data Engineering 2008

Dans un système d'inférence pair-à-pair (P2PIS), un pair étend sa base de connaissances (KB) avec celle des autres pairs afin d'utiliser leurs connaissances pour répondre aux requêtes qui lui sont posées. Toutefois, l'extension d'une KB n'est pas nécessairement conservative. Une extension conservative garantit que le sens d'une KB est le même lorsqu'elle est considérée seule ou avec son extension. En revanche, une extension non conservative peut changer radicalement le sens d'une KB au sein de la théorie résultante. Il est par conséquent crucial pour un pair de savoir si un P2PIS est une extension conservative de sa KB. En effet, si ce n'est pas le cas (i) ses utilisateurs n'ont plus la bonne interprétation de sa KB, donc des requêtes en termes de sa KB et des réponses à celles-ci et (ii) les connaissances qu'il fournit aux autres pairs ne sont pas celles escomptées. Cet article est le premier à s'intéresser à la notion d'extension conservative dans le cadre des P2PIS. Notre contribution est d'étudier théoriquement et algorithmiquement le problème de tester si un P2PIS propositionnel est une extension conservative d'un pair donné. En particulier, nous recourons au calcul de conséquences afin de reformuler ce problème et d'élaborer un algorithme décentralisé pour le résoudre.

Distributed Monitoring of Peer to Peer Systems (demonstration)

Serge Abiteboul, Pierre Bourhis, Bogdan Marinoiu- IEEE Transactions on Knowledge and Data Engineering 2008

Observing highly dynamic Peer-to-Peer systems is essential for many applications such as fault management or business processing. We demonstrate P2PMonitor, a P2P system for monitoring such sys- tems. Alerters deployed on the monitored peers are designed to detect particular kinds of local events. They generate streams of XML data that form the primary sources of information for P2PMonitor. The core of the system is composed of processing components implementing the operators of an algebra over data streams. From a user viewpoint, monitoring a P2P system can be as simple as querying an XML document. The document is an ActiveXML document that aggregates a (possibly very large) number of streams generated by alerters on the monitored peers. Behind the scene, P2PMonitor compiles the monitoring query into a distributed monitoring plan, deploys alerters and stream algebra processors and issues notifications that are sent to users. The system functionalities are demonstrated by simulating the supply chain of a large company.

Experimental Assessment of the TARGET Adaptive Ontology-based Web Search Framework

Chantal Reynaud, Nicolas Guelfi, Cedric Pruski- IEEE Transactions on Knowledge and Data Engineering 2008

Finding relevant information on the Web is often difficult for most of the users. Although Web search applications are improving, they still need to be more intellignet to adapt to the search domain targeted by users, the evolution of this domain and users' characteristics. In this paper, we present an experimental assessment of the TARGET framework for improving the relevance of the documents when users are searching the Web by using adaptive ontologies. This is done first by introducing the TARGET approach. We will briefly present the used ontologies and their ability to adapt to domain evolution. We will then detail the TARGET tool used in our experimentations. This includes its architecture, its ability to carry out the ontology adaptation process as well as the way it searches the Web and ranks the returned results. Finally, we discuss the results obtained using the tool through the presentation of our case study devoted to the retrieval of scientific articles.

Guiding the Ontology Matching Process with reasoning in a PDMS

Chantal Reynaud, Francois-Elie Calvier- IEEE Transactions on Knowledge and Data Engineering 2008 June

This article focuses on ontology matching in a decentralized setting. The work takes place in the MediaD project. Ontologies are the description of peers data belonging to the peer data management system SomeRDFS. First, we present some particularities of ontology matching in a peer-to-peer setting, the data model and query answering in SomeRDFS PDMSs. Then, we show how to take advantage of query answering in order to help discovering new mappings between ontologies, either mapping shortcuts corresponding to a composition of pre-existent mappings or mappings which can not be inferred from the network but yet relevant. This work has been partly implemented. We present the results of some experiments which confirm the interest of our approach.

Guiding the Ontology Matching Process with reasoning in a PDMS

Chantal Reynaud, Francois-Elie Calvier- IEEE Transactions on Knowledge and Data Engineering 2008 October

This article focuses on ontology matching in a decentralized setting. The work takes place in the MediaD project. Ontologies are the description of peers data belonging to the peer data management system SomeRDFS. We show how to take advantage of query answering in order to help discovering new mappings between ontologies, either mapping shortcuts corresponding to a composition of pre-existent mappings or mappings which can not ne inferred from the network but yet relevant.

Modeling the Mashup Space

Serge Abiteboul, Tova Milo, Ohad Greenshpan- IEEE Transactions on Knowledge and Data Engineering 2008

We introduce a formal model for capturing the notion of mashup in its globality. The basic component in our model is the mashlet. A mashlet may query data sources, import other mashlets, use externalWeb services, and specify complex interaction patterns between its components. A mashlet state is modeled by a set of relations and its logic specified by datalog-style active rules. We are primarily concerned with changes in a mashlet state relations and rules. The interactions with users and other applications, as well as the consequent effects on the mashlets composition and behavior, are captured by streams of changes. The model facilitates dynamic mashlets composition, interaction and reuse, and captures the fundamental behavioral aspects of mashups.

On the Complexity of Deriving Schema Mappings from Database Instances

Pierre Senellart, Georg Gottlob- IEEE Transactions on Knowledge and Data Engineering 2008 jun

We introduce a theoretical framework for discovering relationships between two database instances over distinct and unknown schemata. This framework is grounded in the context of data exchange. We formalize the problem of understanding the relationship between two instances as that of obtaining a schema mapping so that a minimum repair of this mapping provides a perfect description of the target instance given the source instance. We show that this definition yields "intuitive" results when applied on database instances derived from each other by basic operations. We study the complexity of decision problems related to this optimality notion in the context of different logical languages and show that, even in very restricted cases, the problem is of high complexity.

Ontology Matching Supported by Query Answering in a P2P System

Chantal Reynaud, Francois-Elie Calvier- IEEE Transactions on Knowledge and Data Engineering 2008

In this article we propose methods to automate the extraction of alignments or mappings between ontologies by using query answering in the peer data management system (PDMS) SomeRDFS which uses a data model based on RDF(S). Query answering is composed of a rewriting and an evaluation phase. We show that query answering provides information that offer automated support for discovering new mappings. It is used to generate either mapping shortcuts corresponding to mapping compositions or mappings which can not be inferred from the network but yet relevant. The strategy followed by a peer and ?ltering criteria de?ned by the administrator of a peer are used to select the most relevant mappings to be represented.

OptimAX: Efficient Support for Data-Intensive Mash-Ups (demo)

Serge Abiteboul, Ioana Manolescu, Spyros Zoupanos- IEEE Transactions on Knowledge and Data Engineering 2008

OptimAX: Optimizing Distributed AXML Applications

Serge Abiteboul, Ioana Manolescu, Spyros Zoupanos- IEEE Transactions on Knowledge and Data Engineering 2008

Path Summaries and Path Partitioning in Modern XML Databases

Ioana Manolescu, Angela Bonifati, Andrei Arion, Andrea Pugliese- IEEE Transactions on Knowledge and Data Engineering 2008 february - 11(1)

Performance Evaluation in Database Research: Principles and Experience (tutorial)

Ioana Manolescu, Stefan Manegold- IEEE Transactions on Knowledge and Data Engineering 2008

Semantic Web take-off in a European Industry Perspective

Francois Goasdou, Jean Charlet, Alain Leger, Johannes Heinecke, Lyndon J.B. Nixon, Pavel Shvaiko, Paola Hobson- IEEE Transactions on Knowledge and Data Engineering 2008

Semantic Web technology is being increasingly applied in a large spectrum of applications in which domain knowledge is conceptualized and formalized (e.g., by means of an ontology) in order to support diversified and automated knowledge processing (e.g., reasoning) performed by a machine. Moreover, through an optimal combination of (cognitive) human reasoning and (automated) machine processing (mimicking reasoning); it becomes possible for humans and machines to share more and more complementary tasks. The spectrum of applications is extremely large and to name a few: corporate portals and knowledge management, e-commerce, e-work, e-business, healthcare, e-government, natural language understanding and automated translation, information search, data and services integration, social networks and collaborative filtering, knowledge mining, business intelligence and so on. From a social and economic perspective, this emerging technology should contribute to growth in economic wealth, but it must also show clear cut value for everyday activities through technological transparency and efficiency. The penetration of Semantic Web technology in industry and in services is progressing slowly but accelerating as new success stories are reported. In this chapter we present ongoing work in the cross-fertilization between industry and academia. In particular, we present a collection of application fields and use cases from enterprises which are interested in the promises of Semantic Web technology. The use cases are focused on the key knowledge processing components that will unlock the deployment of the technology in industry. The chapter ends with the presentation of the current state of the technology and future trends as seen by prominent actors in the field.

Static Analysis of Active XML Systems

S. Abiteboul, L. Segoufin, V.Vianu- IEEE Transactions on Knowledge and Data Engineering 2008

Active XML is a high-level specification language tailored to data-intensive, distributed, dynamic Web services. Active XML is based on XML documents with embedded function calls. The state of a document evolves depending on the result of internal function calls (local computations) or external ones (interactions with users or other services). Function calls return documents that may be active, so may activate new subtasks. The focus of the paper is on the verification of temporal properties of runs of Active XML systems, specified in a tree-pattern based temporal logic, Tree-LTL, that allows expressing a rich class of semantic properties of the application. The main results establish the boundary of decidability and the complexity of automatic verification of Tree-LTL properties.

Systèmes pair-à-pair sémantiques et extension non conservative d'une base de connaissances

Francois Goasdou, Nada Abdallah- IEEE Transactions on Knowledge and Data Engineering 2008

Cet article montre en quoi la notion d'extension non conservative d'une base de connaissances (KB) est importante dans les sytèmes d'inférence pair-à-pair (P2PIS), aussi connus sous le nom de systèmes pair-à-pair sémantiques. Cette notion est utile à un pair afin de détecter si (une partie de) sa KB est corrompue par un P2PIS ou pour apprendre du P2PIS de nouvelles connaissances sur son propre domaine d'application. Cette notion est d'autant plus importante qu'elle a des liens étroits avec la confidentialité des connaissances d'un pair au sein d'un P2PIS et avec la qualité de service fournie par un P2PIS. Nous étudions ici, du point de vue théorique et de l'algorithmique décentralisée, les deux problèmes suivants : (i) décider si un P2PIS est une extension conservative d'un pair donné et (ii) calculer les témoins d'une possible corruption de la KB d'un pair donné par un P2PIS, de sorte à pouvoir l'empêcher. Nous considérons des P2PIS passant à l'échelle d'un millier de pairs et dont l'utilité a déjà été démontrée en Intelligence Artificielle et en Bases de Données.

TaxoMap in the OAEI 2008 alignment contest

Chantal Reynaud, Brigitte Safar, Haifa Zargayouna, Faycal Hamdi- IEEE Transactions on Knowledge and Data Engineering 2008

TaxoMap is an alignment tool which aim is to discover rich correspondences between concepts. It performs an oriented alignment (from a source to a target ontology) and takes into account labels and sub-class descriptions. Our participation in last year edition of the competition have put the emphasis on certain limits. TaxoMap2 is a new implementation of taxoMap that reduces significantly runtime and enables parametrization by specifying the ontology language and different thresholds used to extract different mapping relations. The new implementation stresses on terminological techniques, it takes into account synonymy, and multi-label description of concepts. Special effort was made to handle large-scale ontologies by partitioning input ontologies into modules to align. We conclude the paper by pointing out the necessary improvements that need to be made.

The repeatability experiment of SIGMOD 2008

Ioana Manolescu, Andrei Arion, Pierre Senellart, Loredana Afanasiev, Neoklis Polyzotis, Karl Schnaitter, Spyros Zoupanos, Jens Dittrich, Stefan Manegold, Dennis Shasha- IEEE Transactions on Knowledge and Data Engineering 2008 March - 37(1)

Towards micro benchmarking XQuery

Ioana Manolescu, Philippe Michiels, Cedric Miachon- IEEE Transactions on Knowledge and Data Engineering 2008 - 33(2):182-202

Une aide a la decouverte de mappings dans SomeRDFS

Chantal Reynaud, Francois-Elie Calvier- IEEE Transactions on Knowledge and Data Engineering 2008

Dans cet article, nous nous intéresson sà la découverte de mises en correspondance entre ontologies distribuées modélisant les connaissances de pairs du système de gestion de données P2P SomeRDFS. Plus précisément, nous montrons comment exploiter les mécanismes de raisonnement mis en oeuvre dans SomeRDFS pour aider à découvrir des mappings entre ontologies. Ce travail est réalisé dans le cadre du projet MediaD en partenariat avec France Telecom.

Web Page Rank Prediction with Markov Models

Michalis Vazirgiannis, Pierre Senellart, Dimitris Drosos, Akrivi Vlachou- IEEE Transactions on Knowledge and Data Engineering 2008 apr

In this paper we propose a method for predicting the ranking position of a Web page. Assuming a set of successive past top-k rankings, we study the evolution of Web pages in terms of ranking trend sequences used for Markov Models training, which are in turn used to predict future rankings. The predictions are highly accurate for all experimental setups and similarity measures.

WebContent: Efficient P2P Warehousing of Web Data

Serge Abiteboul, Benjamin Nguyen, Ioana Manolescu, Francois Goasdou, Philippe Chatalic, Spyros Zoupanos, Gabriel Vasile, Georges Gardarin, Tristan Allard, Anca Ghitescu, Mohamed Ouazara, Aditya Somani, Nicolas Travers- IEEE Transactions on Knowledge and Data Engineering 2008

We present the WebContent platform for managing distributed repositories of XML and semantic Web data. The platform allows integrating various data processing building blocks (crawling, translation, semantic annotation, full-text search, structured XML querying, and semantic querying), presented as Web services, into a large-scale efficient platform. Calls to various services are combined inside ActiveXML documents, which are XML documents including service calls. An ActiveXML optimizer is used to: (i) efficiently distribute computations among sites; (ii) perform XQuery-specific optimizations by leveraging an algebraic XQuery optimizer; and (iii) given an XML query, chose among several distributed indices the most appropriate in order to answer the query.

XML Processing in DHT networks

Serge Abiteboul, Ioana Manolescu, Nicoleta Preda, Neoklis Polyzotis, Chong Sun- IEEE Transactions on Knowledge and Data Engineering 2008

A Calculus and Algebra for Distributed Data Management

S. Abiteboul- IEEE Transactions on Knowledge and Data Engineering 2007 :1-11

The sharing of content by communities of users (e.g., scientists) in a P2P context remains cumbersome. We argue that main reasons for this is the lack of calculus and algebra for distributed data management. We present the ActiveXML language that extends the XML language with features to handle distribution. More precisely, ActiveXML documents are XML documents with a special syntax for specifying the embedding of Web service calls, e.g. XML queries such as XQueries. We also present ActiveXML algebra that extends ActiveXML notably with explicit control of data exchanges. ActiveXML algebra allows describing query plans, and exchanging them between peers.

Approche logique pour la réconciliation de references

Nathalie Pernelle, Marie-Christine Rousset, Fatiha Sais- IEEE Transactions on Knowledge and Data Engineering 2007 january

Le problème de réconciliation de références consiste à décider si deux descriptions provenant de sources distinctes réfèrent ou non à la même entité du monde réel. Dans cet article, nous étudions ce problème quand le schéma des données est décrit en RDFS étendu par certaines primitives de OWL-DL. Nous décrivons et montrons l'intérêt d'une approche logique basée sur des règles de réconciliation qui peuvent être générées automatiquement à partir des axiomes du schéma. Ces règles traduisent de façon déclarative les dépendances entre réconciliations qui découlent de la sémantique du schéma. Les premiers résultats ont été obtenus sur des données réelles dans le cadre du projet PICSEL 3 en collaboration avec France Telecom R&D.

Comprendre le Web caché. Understanding the Hidden Web

Pierre Senellart- IEEE Transactions on Knowledge and Data Engineering 2007 dec

The hidden Web (also known as deep or invisible Web), that is, the part of the Web not directly accessible through hyperlinks, but through HTML forms or Web services, is of great value, but difficult to exploit. We discuss a process for the fully automatic discovery, syntactic and semantic analysis, and querying of hidden-Web services. We propose first a general architecture that relies on a semi-structured warehouse of imprecise (probabilistic) content. We provide a detailed complexity analysis of the underlying probabilistic tree model. We describe how we can use a combination of heuristics and probing to understand the structure of an HTML form. We present an original use of a supervised machine-learning method, namely conditional random fields, in an unsupervised manner, on an automatic, imperfect, and imprecise, annotation based on domain knowledge, in order to extract relevant information from HTML result pages. So as to obtain semantic relations between inputs and outputs of a hidden-Web service, we investigate the complexity of deriving a schema mapping between database instances, solely relying on the presence of constants in the two instances. We finally describe a model for the semantic representation and intensional indexing of hidden-Web sources, and discuss how to process a user's high-level query using such descriptions.

Constant-memory validation of streaming XML documents against DTDs

Luc Segoufin, Cristina Sirangelo- IEEE Transactions on Knowledge and Data Engineering 2007

In this paper we investigate the problem of validating, with constant memory,streaming XML documents with respect to a DTD. Such constant memory validations can only be performed for some but not all DTDs. This paper gives a non trivial interesting step towards characterizing those DTDs for which a constant-memory on-line algorithm exists.

Decouverte de correspondances entre ontologies distribuees

Chantal Reynaud, Francois-Elie Calvier- IEEE Transactions on Knowledge and Data Engineering 2007

Dans cet article, nous nous intéressons à la découverte de mises en correspondance entre ontologies modélisant les connaissances des pairs d'un système pair à pair de gestion de données (PDMS). Ce travail est réalisé dans le cadre du projet MediaD au sein de la plate-forme SomeRDFS dans le contexte duquel les ontologies, les données et les mappings sont exprimées en RDF(S). Nous présentons ce contexte dans une première partie. Nous précisons ensuite les spécificités du processus de mise en correpsondance dans le cadre de PDMS par opposition au contexte centralisé. Nous montrons enfin comment le raisonnement effectué dans un PDMS peut aider à découvrir des mappings entre ontologies et décrivons quelques techniques de découverte de mappings utilisables.

Determinacy and Rewriting of Conjunctive Queries Using Views: A Progress Report

Victor Vianu, Luc Segoufin, Alan Nash- IEEE Transactions on Knowledge and Data Engineering 2007

Suppose we are given a set of exact conjunctive views V and a conjunctive query Q. Suppose we wish to answer Q using V, but the classical test for the existence of a conjunctive rewriting of Q using V answers negatively. What can we conclude: (i) there is no way Q can be answered using V, or (ii) a more powerful rewriting language may be needed. This has been an open question, with conventional wisdom favoring (i). Surprisingly, we show that the right answer is actually (ii). That is, even if V provides enough information to answer Q, it may not be possible to rewrite Q in terms of V using just conjunctive queries -- in fact, no monotonic language is sufficiently powerful. We also exhibit several well-behaved classes of conjunctive views and queries for which conjunctive rewritings remain sufficient. This continues a previous investigation of rewriting and its connection to semantic determinacy, for various query and view languages.

Distributed Access Control: A Privacy-conscious Approach

Bogdan Cautis- IEEE Transactions on Knowledge and Data Engineering 2007 June

With more and more information being exchanged or published on the Web or in peer-to-peer, and with the significant growth in numbers of distributed, heterogeneous data sources, issues like access control and data privacy are becoming increasingly complex and difficult to manage. Very often, when dealing with sensitive information in such settings, the specification of access control policies and their enforcement are no longer handled by the actual data sources, and are (partially) delegated to third-parties. Besides practical reasons, this is the case when decisions regarding access depend on factors which overpass the scope and knowledge of some of the entities involved. More specifically, policies may depend on \emph{private} aspects concerning users (accessing data) or data owners. In this case, the only solution is to entrust some third-party authority with all the information needed to apply access policies. However, as the policies themselves depend on sensitive information, this outsourcing raises new privacy issues, that were not present in centralized environments. In particular, information leaks may occur during access control enforcement. In this paper, we consider these issues and, starting from non-conventional digital signatures, we take a first step towards an implementation solution for such settings where both data and access policies are distributed. Our approach involves rewriting user queries into forms which are authorized, and we illustrate this for both structured (relational) and semi-structured (XML) data and queries.

Distributed Monitoring of Peer to Peer Systems

Serge Abiteboul, Bogdan Marinoiu- IEEE Transactions on Knowledge and Data Engineering 2007

In this paper, we are concerned with the distributed monitoring of P2P systems. We ntroduce the P2P Monitor system and a new declarative language, namely P2PML, for specifying monitoring tasks. A P2PML subscription is compiled into a distributed algebraic plan which is described using algebra over XML streams. The operators of this algebra are first alerters in charge of detecting specific events and acting as stream sources. Other operators process the streams or publish them. We introduce a filter for streams of XML documents that scales by processing first simple conditions and then, if still needed, evaluating complex queries. We also show how particular tasks can be supported by identifying subtasks that are already provided by existing streams.

EDOS Distribution System: a P2P architecture for open-source content dissemination

Serge Abiteboul, Dan Vodislav, Radu Pop, Itay Dar, Gabriel Vasile- IEEE Transactions on Knowledge and Data Engineering 2007 June - 234:209,215

The open-source software communities currently face an increasing complexity of managing the software content among theirs developers and contributors. This is mainly due to the continuously growing size of the software, the high frequency of the updates, and the heterogeneity of the participants. We propose a distribution system that tackles two main issues in the software content management: efficient content dissemination through a P2P system architecture, and advanced information system capabilities, using a distributed index for resource location.

Exploiting WordNet as Background Knowledge

Chantal Reynaud, Brigitte Safar- IEEE Transactions on Knowledge and Data Engineering 2007

A lot of alignment systems providing mappings between the concepts of two ontologies rely on an additional source, called background knowledge, represented most of the time by a third ontology. The objective is to complement others current matching techniques. In this paper, we present the difficulties encountered when using WordNet as background knowledge and we show how the TaxoMap system we implemented can avoid those difficulties.

Finding Related Pages Using Green Measures: An Illustration with Wikipedia

Pierre Senellart, Yann Ollivier- IEEE Transactions on Knowledge and Data Engineering 2007 jul

We introduce a new method for finding nodes semantically related to a given node in a hyperlinked graph: the Green method, based on a classical Markov chain tool. It is generic, adjustment-free and easy to implement. We test it in the case of the hyperlink structure of the English version of Wikipedia, the on-line encyclopedia. We present an extensive comparative study of the performance of our method versus several other classical methods in the case of Wikipedia. The Green method is found to have both the best average results and the best robustness.

Incremental View Maintenance for Active Documents

Serge Abiteboul, Bogdan Marinoiu, Pierre Bourhis- IEEE Transactions on Knowledge and Data Engineering 2007

In this paper, we develop algorithmic datalogbased foundations for the incremental processing of tree-pattern queries over active documents, i.e. document with incoming streams of data. We define one query satisfiability for such documents based on a logic with 3-values: "true", "false forever", and "false for now". Also, given an active document and a query, part of the document (and in particular, some incoming streams) may become irrelevant for the query of interest. We introduce an incremental algorithm for detecting such useless data and streams, essential for implementing garbage collection. We also provide complexity analysis for the problems we study.

L2R: a Logical method for Reference Reconciliation

Fatiha Sais, Nathalie Pernelle and Marie-Christine Rousset- IEEE Transactions on Knowledge and Data Engineering 2007 July:329--334

The reference reconciliation problem consists in deciding whether different identifiers refer to the same data, i.e., correspond to the same world entity. The L2R system exploits the semantics of a rich data model, which extends RDFS by a fragment of OWL-DL and SWRL rules. In L2R, the semantics of the schema is translated into a set of logical rules of reconciliation, which are then used to infer correct decisions both of reconciliation and no reconciliation. In contrast with other approaches, the L2R method has a precision of 100% by construction. First experiments show promising results for recall, and most importantly significant increases when rules are added.

Large Scale P2P Distribution of Open-Source Software

Serge Abiteboul, Dan Vodislav, Nicoleta Preda, Radu Pop, Itay Dar, Gabriel Vasile- IEEE Transactions on Knowledge and Data Engineering 2007

Open-source software communities currently face an increasing complexity in managing and distributing software content among their developers and contributors. This is mainly due to the continuously growing size of the software, of the community, the high frequency of updates, and the heterogeneity of the participants. We propose a large scale P2P distribution system that tackles two main issues in software content management: efficient content dissemination and advanced information system capabilities.

Les ontologies pour la recherche ciblee d'information sur le Web

Chantal Reynaud, Nicola Guelfi, Cedric Pruski- IEEE Transactions on Knowledge and Data Engineering 2007

L'utilisation des technologies du Web Sémantique s'est généralisée au cours de ces dernières années. Ceci est vrai, en particulier, pour les langages de définition d'ontologies nécessitant l'adaptation des outils basés sur l'exploitation de ce type de ressources terminologiques pour qu'ils fonctionnent à partir de connaissances représentées dans le langage standard OWL. Cet article concerne la recherche d'information sur le Web basée sur l'utilisation d'ontologies. Notre contribution est double. Dans un premier temps, nous étudions comment exploiter les connaissances représentées en OWL dans O3, une approche de recherche de documents sur le Web, et nous proposons d'étendre ce langage en intégrant de nouvelles relations sémantiques. Dans un deuxième temps, nous montrons comment utiliser les ontologies construites en OWL et son extension pour l'expansion de requêtes afin d'améliorer la précision des résultats lors de recherche de documents sur le Web. Quelques résultats d'expérimentation obtenus à l'aide du prototype TARGET sont présentés.

Monitoring Peer to Peer Systems

Serge Abiteboul, Bogdan Marinoiu- IEEE Transactions on Knowledge and Data Engineering 2007

In this paper, we are concerned with the distributed monitoring of P2P systems. We introduce the P2P Monitor system and a new declarative language, namely P2PML, for specifying monitoring tasks. A subscription is compiled into a distributed algebraic plan which is described using an algebra over XML streams. We introduce a Filtler for streams of XML documents that scales by processing first simple conditions and then, if still needed, evaluating complex queries. We also show how particular tasks can be supported by identifying subtasks already provided by existing streams.

On the Complexity of Managing Probabilistic XML Data

Serge Abiteboul, Pierre Senellart- IEEE Transactions on Knowledge and Data Engineering 2007 Jun

In [3], we introduced a framework for querying and updating probabilistic information over unordered labeled trees, the probabilistic tree model. The data model is based on trees where nodes are annotated with conjunctions of probabilistic event variables. We briefly described an implementation and scenarios of usage. We develop here a mathematical foundation for this model. In particular, we present complexity results. We identify a very large class of queries for which simple variations of querying and updating algorithms from [3] compute the correct answer. A main contribution is a full complexity analysis of queries and updates. We also exhibit a decision procedure for the equivalence of probabilistic trees and prove it is in co-RP. Furthermore, we study the issue of removing less probable possible worlds, and that of validating a probabilistic tree against a DTD. We show that these two problems are intractable in the most general case.

On-Line Index Selection for Shifting Workloads

Serge Abiteboul, Tova Milo, Neoklis Polyzotis, Karl Schnaitter- IEEE Transactions on Knowledge and Data Engineering 2007

This paper introduces COLT (Continuous On-Line Tuning), a novel framework that continuously monitors the workload of a database system and enriches the existing physical design with a set of effective indices. The key idea behind COLT is to gather performance statistics at different levels of detail and to carefully allocate profiling resources to the most promising candidate configurations. Moreover, COLT uses effective heuristics to self-regulate its own performance, lowering its overhead when the system is well tuned and being more aggressive when the workload shifts and it becomes necessary to re-tune the system. We describe an implementation of the proposed framework in the PostgreSQL database system and evaluate its performance experimentally. Our results validate the effectiveness of COLT and demonstrate its ability to modify the system configuration in response to changes in the query load.

OptimAX: optimizing distributed continuous queries

Serge Abiteboul, Ioana Manolescu, Spyros Zoupanos- IEEE Transactions on Knowledge and Data Engineering 2007

The system we propose to present, OptiMAX, applies the principles of distributed query optimization to the problem of distributed evaluation of continuous XML queries. OptiMAX is an optimizer for Active XML documents (AXML in short). It is implemented as a module which can be used next to an AXML peer, and it may be invoked whenever users ask queries on their AXML documents. The optimizer draws up an initial query plan, and then attempts to rewrtite it using a combination of heuristics and cost information in order to improve the plan's performance estimates. An interesting feature is that all plans are AXML documents themselves. When the optimizer has retained a plan, it hands it to the AXML peer, which evaluates it directly following the decisions taken by the optimizer.

P2PTester: a tool for measuring P2P platform performance

Benjamin Nguyen, Ioana Manolescu, Florin Dragan, Nicoleta Preda, Radu Pop, Bogdan Butnaru, Georges Gardarin, Laurent Yeh- IEEE Transactions on Knowledge and Data Engineering 2007 April

The current abundance and complexity of P2P architectures makes it extremely difficult to assess their performance. P2PTester is the first tool devised to interface with, and measure the performance of, existing P2P data management platforms. We isolate basic components present in current P2P platforms, and insert "hooks" for P2PTester to capture, analyze and trace the interactions taking place in the underlying distributed system.

Passage a l echelle de la reconciliation de concepts et de la reconciliation de references :

Nathalie Pernelle, Fatiha Sais- IEEE Transactions on Knowledge and Data Engineering 2007 january

L'intégration de données provenant de différentes sources se confronte à deux problèmes majeurs : l'hétérogénéité des schémas et les variations dans les descriptions des instances. Nous avons comparé ces deux problèmes en se focalisant sur l'existence de méthodes permettant de faire face au passage à l'échelle et donc au traitement de données nombreuses et éventuellement distribuées.

Probing the Hidden Web. Research Internship Report.

Avin Mittal- IEEE Transactions on Knowledge and Data Engineering 2007 jul

Hidden databases on the web are an important source of information, which are not usually indexed by traditional search engines. Documents on the hidden web are not reachable by following the hyperlinked structure of the graph, and so, the need is being felt for a system which can automatically discover, understand and index hidden web documents. In this article, we present a system which analyzes the structure of HTML forms, annotates the various fields in the form with concepts from the knowledge domain, probes the fields with domain specific queries, and analyzes the response pages to find out whether or not they contain any results, to further confirm the annotations. The system also wraps the given HTML form into a web service described by a WSDL document, whose inputs are labeled by the domain concepts and which can be used to query the form to get response pages.

Querying data sources that export infinite sets of views (Technical Report)

Bogdan Cautis, Alin Deutsch, Nicola Onose- IEEE Transactions on Knowledge and Data Engineering 2007 Avril

We study the problem of querying data sources that accept only alimited set of queries, such as sources accessible by Web serviceswhich can implement very large (potentially infinite) families ofqueries. We revisit a classical setting in which theapplication queries are conjunctive queries and the source acceptsfamilies of (possibly parameterized) conjunctivequeries specified as the expansions of a(potentially recursive) Datalog program with parameters.We say that query $Q$ is{em expressible} by the program calP if it is equivalent to someexpansion of calP. $Q$ is {em supported} by calP if it has anequivalent rewriting using some finite set of calP's expansions.We present the first study of expressibility and support for sourcesthat satisfy integrity constraints, which is generally the casein practice.We start by closing a problem left open by prior work even whileignoring constraints, namely the precise relationship betweenexpressibility and support: surprisingly, we show them to beinter-reduciblein PTIME in both the absence and the presence of constraints. This enablesus to employ the same algorithm for solving both problems.We identify practically relevant restrictions on theprogram specifying the views that ensure decidability under a mix ofkey and weakly acyclic foreign key constraints, and beyond.We show that these restrictions are as permissive as possible,since their slightest relaxation leads to undecidability.We present an algorithm that is guaranteed to be sound when appliedto unrestricted input and in addition complete under the restrictions.As a side-effect of ourinvestigation, we improve the previously best known upper bound fordeciding support in the constraint-free case.

Reasoning about XML Update Constraints

Serge Abiteboul, Tova Milo, Bogdan Cautis- IEEE Transactions on Knowledge and Data Engineering 2007 June

We introduce in this paper a class of constraints for describing how an XML document can evolve, namely \emph{XML update constraints}. For these constraints, we study the implication problem, giving algorithms and complexity results for constraints of varying expressive power. Besides classical constraint implication, we also consider an instance-based approach. More precisely, we study implication with respect to a current tree instance, resulting from a series of unknown updates. The main motivation of our work is reasoning about data integrity under update restrictions in contexts where owners may lose control over their data, such as in publishing or exchange.

Reconciliation de references : une approche adaptee aux grands volumes de donnees

Fatiha Sais, Nathalie Pernelle and Marie-Christine Rousset- IEEE Transactions on Knowledge and Data Engineering 2007 June:521--532

Le problème de réconciliation de références est un problème majeur pour l'intégration ou la fusion de données provenant de plusieurs sources. Il consiste à décider si deux descriptions provenant de sources distinctes réfèrent ou non à la même entité du monde réel. Cependant, la tâche de réconciliation de références est souvent confrontée à des grandes quantités de données mais aussi à des contraintes de performances en temps d'exécution. En vue d'un passage à l'échelle, nous présentons dans cet article, des techniques que nous intégrons dans notre approche logique L2R de réconciliation de références.

Regular tree languages definable in FO and FOmod

Luc Segoufin, Michael Benedikt- IEEE Transactions on Knowledge and Data Engineering 2007

We consider regular languages of labeled trees. We give an algebraic characterization of the regular languages over such trees that are definable in first-order logic in the language of labeled graphs. These languages are the analog on trees of the ``locally threshold testable'' languages on strings. We show that this characterization yields a decision procedure for determining whether a regular collection of trees is first-order definable: the procedure is polynomial time in the minimal automaton presenting the regular language. We also give a characterization of definability in first-order logic supplemented with modular quantifiers, and a corresponding decision procedure.

Scalability Evaluation of a P2P Content Distribution System

Serge Abiteboul, Dan Vodislav, Radu Pop, Gabriel Vasile- IEEE Transactions on Knowledge and Data Engineering 2007

Scalability is a key issue in current content distribution systems for web communities, which have to disseminate increasingly large amounts of data to increasingly large communities of users. We believe that peer-to-peer (P2P) architectures provide good solutions for scalable distribution of large amounts of data over the web. This paper presents EDOS, a P2P system for open-source software distribution, able to disseminate gigabytes of data to thousands of users on the web. We focus on the publishing and querying functionalities and we present several experiments realized on the Grid'5000 network, that confirm the system's scalability.

Snapshot on the EDOS Distribution System

Serge Abiteboul, Dan Vodislav, Radu Pop, Itay Dar, Gabriel Vasile- IEEE Transactions on Knowledge and Data Engineering 2007 February

The open-source software communities currently face an increasing complexity of managing the software content among theirs developers and contributors. This is mainly due to the continuously growing size of the software, the high frequency of the updates, and the heterogeneity of the participants. We propose a distribution system that tackles two main issues in the software content management: efficient content dissemination through a P2P system architecture, and advanced information system capabilities, using a distributed index for resource location. We believe that both aspectsare essential in improving the performances of the general system and each particular choice in the system's architecture completes the overall functionalities.

SomeRDFS in the Semantic Web

Marie-Christine Rousset, Francois Goasdou, Philippe Adjiman- IEEE Transactions on Knowledge and Data Engineering 2007 - 8

The Semantic Web envisions a world-wide distributed architecture where computational resources will easily inter-operate to coordinate complex tasks such as query answering. Semantic marking up of web resources using ontologies is expected to provide the necessary glue for making this vision work. Using ontology languages, (communities of) users will build their own ontologies in order to describe their own data. Adding semantic mappings between those ontologies, in order to semantically relate the data to share, gives rise to the Semantic Web: data on the web that are annotated by ontologies networked together by mappings. In this vision, the Semantic Web is a huge semantic peer data management system. In this paper, we describe the SomeRDFS peer data management systems that promote a "simple is beautiful" vision of the Semantic Web based on data annotated by RDFS ontologies.

TaxoMap in the OAEI 2007 alignment contest

Chantal Reynaud, Brigitte Safar, Haifa Zargayouna- IEEE Transactions on Knowledge and Data Engineering 2007

This paper presents our first participation in the OAEI 2007 campaign. It describes an approch to align taxonomies which relies on terminological and structural techniques applied sequentially. We performed our method with various taxonomies using our prototype, taxoMap. Previous experimental results were encouraging and demonstrate the relevance of this alignment approcah. In this paper, we evaluate the strengths and the weaknesses of taxoMap int he context of the OAEI campaign where the ontologies to align are different from taxonomies we are used to deal with.

Techniques d alignement d ontologies basees sur la structure d une ressource complementaire

Chantal Reynaud, Brigitte Safar, Francois-Elie Calvier- IEEE Transactions on Knowledge and Data Engineering 2007 0ctober

Pour identifier des mappings entre les concepts de deux ontologies, de nombreux travaux récents portent sur l'utilisation de connaissances complémentaires dites de "background" ou de support, représentées le plus souvent sous la forme d'une 3ème ontologie. Leur objectif commun est de compléter les techniques classiques d'appariement qui exploitent la structure ou la richesse du langage de représentation des ontologies, et qui ne s'appliquent plus quand les ontologies à apparier sont faiblement structurées ou se limitent à de simples hiérarchies de classification. Cet article présente une analyse comparative des travaux utilisant des connaissances de support, en commençant par leur schéma général commun, suivi par une analyse des travaux en fonction du type de connaissance de support utilisée. Nousétudins ensuite les probèmes rencontrés lorsque la connaissance de support est WordNet puis nous montrons comment notre système taxoMap résout ces difficultés.

Techniques structurelles d'alignement pour portails web

Chantal Reynaud, Brigitte Safar- IEEE Transactions on Knowledge and Data Engineering 2007

Le travail décrit dans cet article a pour objectif d'unifier l'accès aux documents d'un domaine d'application. Il permet en particulier d'augmenter le nombre de documents accessibles à partir de portails Web sans en modifier l'interface d'interrogation. L'accès aux documents est supposé s'appuyer sur des taxonomies que nous proposons d'aligner. Cet article porte spécifiquement sur les techniques structurelles mises en oeuvre, qui sont originales et particulières dans la mesure o� elles sont adaptées au traitement de taxonomies dont les structures sont hétérogènes et dissymétriques. Nous présentons et analysons ensuite les résultats de trois expérimentations effectuées sur diverses taxonomies, des taxonomies réelles qui ont motivé notr approche ainsi que des taxonomies test mises à disposition des chercheurs de la communauté.

The Data Ring: Community Content Sharing

Serge Abiteboul, Neoklis Polyzotis- IEEE Transactions on Knowledge and Data Engineering 2007

Information ubiquity has created a large crowd of users (most notably scientists), who could employ DBMS technology to process and share their data more effectively. Still, this user base prefers to keep its data in files that can be easily managed by applications such as spreadsheets, rather than deal with the complexity and rigidity of modern database systems. In this paper, we propose a vision for enabling non-experts, such as scientists, to build \emph{content sharing communities} in a true database fashion: declaratively. The proposed infrastructure, called the {\em data ring}, enables users to manage and share their data with minimal effort; the user points to the data that should be shared, and the data ring becomes responsible for automatically indexing the data (to make it accessible), replicating it (for availability), and reorganizing its physical storage (for better query performance). We outline the salient features of our proposal, and outline research challenges that must be addressed in order to realize this vision.

The Data Ring: Community Content Sharing

Serge Abiteboul, Neoklis Polyzotis- IEEE Transactions on Knowledge and Data Engineering 2007

Information ubiquity has created a large crowd of users (most notably scientists), who could employ DBMS technology to process and share their data more effectively. Still, this user base prefers to keep its data in files that can be easily managed by applications such as spreadsheets, rather than deal with the complexity and rigidity of modern database systems. In this paper, we propose a vision for enabling non-experts, such as scientists, to build \emph{content sharing communities} in a true database fashion: declaratively. The proposed infrastructure, called the {\em data ring}, enables users to manage and share their data with minimal effort; the user points to the data that should be shared, and the data ring becomes responsible for automatically indexing the data (to make it accessible), replicating it (for availability), and reorganizing its physical storage (for better query performance). We outline the salient features of our proposal, and outline research challenges that must be addressed in order to realize this vision.

Towards a Characterization of Order-Invariant Queries over Tame Structures

Luc Segoufin, Michael Benedikt- IEEE Transactions on Knowledge and Data Engineering 2007

This work deals with the expressive power of logics on finite structures with access to an additional ``arbitrary'' linear order. The queries that can be expressed this way are the order-invariant queries for the logic. For thestandard logics used in computer science, such as first-order logic, it is known that access to an arbitrary linear order increases the expressiveness of the logic. However, when we lookat the separating examples, we find that they have satisfying models whose Gaifman Graph is complex -- unbounded in valence and in treewidth. We thus explore the expressiveness of order-invariant queries over graph-theoretically well-behaved structures. We prove that first-order order-invariant queries over strings andtrees have no additional expressiveness over first-order logic in the original signature. We also prove new upper bounds on order-invariant queries over bounded treewidth and boundedvalence graphs. Our results make use of a new technique of independent interest: the application of algebraic characterizations of definability to show collapse results. Was presented at CSL 2005.

Towards the Adaptive Web using Metadata Evolution

Chantal Reynaud, Nicolas Guelfi, Cedric Pruski- IEEE Transactions on Knowledge and Data Engineering 2007

The evolution of Web information is of utmost importance in the design of good Web Information Systems applications. New emerging paradigms, like the Semantic Web, use ontologies for describing metadata and are defined, in part to aid in Web evolution. In this chapter, we survey techniques for ontology evolution. After identifying the different kinds of evolution the Web is confronted with, we detail the various existing languages and techniques devoted to web data evolution, with particular attention to semantic Web concepts, and how these languages and techniques can be adapted to evolving data in order to improve the quality of Web Information Systems applications.

Understanding and Supporting Ontology Evolution by Observing the WWW Conference

Chantal Reynaud, Nicolas Guelfi, Cedric Pruski- IEEE Transactions on Knowledge and Data Engineering 2007

Ontologies which represent domain knowledge in information systems are efficient to enhance information retrieval. However, information is evolving over time and thus it should be also expressible at ontology level. Unfortunately, we consider that ontology evolution is barely study and its basic principles have not been yet precisely defined according to our notion of evolution. In this paper, we have followed a bottom-up approach consisting in a rigourous analysis of the evolution of a particular domain over a significant period of time (namely the WWW series of conference over a decade) to highlight concrete domain knowledge evolutions. We then have generalized and we present a precise set of evolution features that should be offered by ontology metamodels. We also evaluate the modelling capabilities of OWL to represent these features and finally, we show the contribution of ontology evolution support to improve Web information retrieval.

Utilisation de connaissances supplementaires pour la decouverte de mappings dans le systeme TaxoMap

Chantal Reynaud, Brigitte Safar- IEEE Transactions on Knowledge and Data Engineering 2007

Après avoir décrit l'approche générale mise en oeuvre dans TaxoMap, nous présenterons tout d'abord les aspects favorisant son application à grande échelle. Nous nous focaliserons ensuite sur une des techniques de TaxoMap, spécifique au rapprochement de modèle sontologiques pauvres sémantiquement. Plus pécisément, nous traiterons de l'utilisation de connaissances supplémentairespour pallier l'insuffisance de connaissances contenues dans les modèles alignés. Enfin, nous effectuerons uen analyse comparative de cette technique avec l'état de l'art

XQueC: A Query-Conscious Compressed XML Database

Ioana Manolescu, Angela Bonifati, Andrei Arion, Andrea Pugliese- IEEE Transactions on Knowledge and Data Engineering 2007 5 - 7(2)

XML compression has gained prominence recently because it counters the disadvantage of the verbose representation XML gives to data. In many applications, such as data exchange and data archiving, entirely compressing and decompressing a document is acceptable. In other applications, where queries must be run over compressed documents, compression may not be beneficial since the performance penalty in running the query processor over compressed data outweighs the data compression benefits. While balancing the interests of compression and query processing has re- ceived significant attention in the domain of relational databases, these results do not immediately translate to XML data. In this article, we address the problem of embedding compression into XML databases without degrading query performance. Since the setting is rather different from relational databases, the choice of compression granularity and compression algorithms must be revisited. Query execution in the compressed domain must also be rethought in the framework of XML query processing due to the richer structure of XML data. Indeed, a proper storage design for the compressed data plays a crucial role here. The XQueC system (XQuery Processor and Compressor) covers a wide set of XQuery queries in the compressed domain and relies on a workload-based cost model to perform the choices of the compression granules and of their corresponding compression algorithms. As a consequence, XQueC provides efficient query processing on compressed XML data. An extensive experimental assessment is presented, showing the effectiveness of the cost model, the compression ratios, and the query execution times.

A Compact Representation for Least Common Subsumers in the description logic ALE.

Marie-Christine Rousset., Chan Le Duc, Nhan Le Thanh- IEEE Transactions on Knowledge and Data Engineering 2006 - 19(3):239--273

This paper introduces a compact representation which helps to avoid the exponential blow-up in space of the Least Common Subsumer (lcs) of two ALE-concept descriptions. Based on the compact representation we define a space of specific graphs which represents all ALE-concept descriptions including the lcs. Next, we propose an algorithm exponential in time and polynomial in space for deciding subsumption between concept descriptions represented by graphs in this space. These results provide better understanding of the double exponential blow-up of the approximation of ALC-concept descriptions by ALE-concept descriptions: double exponential size of the approximation in the ordinary representation is unavoidable in the worst case.

Reasoning with Inconsistencies in Propositional Peer-to-Peer Inference Systems.

Philippe Chatalic, Gia-Hien Nguyen, Marie-Christine Rousset.- IEEE Transactions on Knowledge and Data Engineering 2006 August:352--357

In a peer-to-peer inference system, there is no centralized control or hierarchical organization: each peer is equivalent in functionality and cooperates with other peers in order to solve a collective reasoning task. Since peer theories model possibly different viewpoints, even if each local theory is consistent, the global theory may be inconsistent. We exhibit a distributed algorithm detecting inconsistencies in a fully decentralized setting. We provide a fully distributed reasoning algorithm, which computes only well-founded consequences of a formula, i.e., with a consistent set of support.

Somewhere in the Semantic Web.

Marie-Christine Rousset, Philippe Adjiman, Philippe Chatalic, Francois Goasdoué, Laurent Simon.- IEEE Transactions on Knowledge and Data Engineering 2006 January

In this paper, we describe the SomeWhere semantic peer-to-peer data management system that promotes a "small is beautiful" vision of the Semantic Web based on simple personalized ontologies (e.g., taxonomies of classes) but which are distributed at a large scale. In this vision of the Semantic Web, no user imposes to others his own ontology. Logical mappings between ontologies make possible the creation of a web of people in which personalized semantic marking up of data cohabits nicely with a collaborative exchange of data. In this view, the Web is a huge peer-to-peer data management system based on simple distributed ontologies and mappings.

A Framework for Distributed Data Management

Serge Abiteboul, Ioana Manolescu, Emanuel Taropa- IEEE Transactions on Knowledge and Data Engineering 2006

Algebra-Based Identification of Tree Patterns in XQuery.

Yannis Papakonstantinou, Veronique Benzaken, Ioana Manolescu, Andrei Arion, Ravi Vijay- IEEE Transactions on Knowledge and Data Engineering 2006 :13-25

Query processing performance in XML databases can be greatly enhanced by the usage of materialized views whose content has been stored in the database. This requires a method for identifying query subexpressions matching the views, a process known as view-based query rewriting. This process is quite complex for relational databases, and all the more daunting on XML databases. Current XML materialized view proposals are based on tree patterns, since query navigation is conceptually close to such patterns. However, the existing algorithms for extracting tree patterns from XQuery do not detect patterns across nested query blocks. Thus, complex, useful tree pattern views may be missed by the rewriting algorithm. We present a novel tree pattern extraction algorithm from XQuery queries, able to identify larger patterns than previous methods.

Alignement de taxonomies pour l'integration de sources d'information heterogenes

Chantal Reynaud, Brigitte Safar, Hassen Kefi- IEEE Transactions on Knowledge and Data Engineering 2006

Integrer des sources d'information heterogenes permet un acces unifie sans modification du contenu. Les schemas des sources sont mis en correspondance de facon a ce que qu'il soit possible d'acceder a tout un ensemble de documents provenant de sources multiples, a partir d'un systeme d'interrogation unique. La specification de ces mises en correspondance, ou mappings, est ainsi au coeur de l'integration. Nous proposons d'utiliser differentes techniques d'alignement de taxonomies pour automatiser leur generation. Ces techniques ont ete implementees dans un outil logiciel TaxoMap qui recherche les mappings et, en cas d'echec, donne des indications pour aider les utilisateurs a les specifier eux-memes. Nous presentons et discutons des resultats issus d'experiences realisees dans le domaine de la microbiologie. Nous testons TaxoMap sur differentes taxonomies issues de domaines varies.

Data Ring: Let Us Turn the Net into a Database!

Serge Abiteboul, Neoklis Polyzotis- IEEE Transactions on Knowledge and Data Engineering 2006

Because of information ubiquity, one observes an important trend towards transferring information management tasks from database systems to networks. We introduce the notion of Data Ring that can be seen as a network version of a database or a content warehouse. A main goal is to achieve better performance for content management without requiring the acquisition of explicit control over information resources. We discuss the main traits of Data Rings and argue that Active XML provides an appropriate basis for such systems. The collaborating peers that form the Data Ring are autonomous, heterogeneous and their capabilities may greatly vary, e.g., from a sensor to a large database. To support effectively this paradigm of loose integration, the Data Ring enforces a seamless transition between data and metadata and between explicit and intentional data. It does not distinguish between data provided by web pages and data provided by web services, between local (extensional) data and external data obtained via a Web service call. This is achieved using the Active XML technology that is based on exchanging XML documents with embedded service calls both for the logical and physical data model.

Decouverte de relations candidates a l enrichissement d'entrepot thematiques

Nathalie Pernelle, Fatiha Sais, Hélène Gagliardi, Ollivier Haemmerlé- IEEE Transactions on Knowledge and Data Engineering 2006 june

Ce travail a pour objectif d'enrichir automatiquement un entrepôt théematique de donnéees `a partir de documents de format divers provenant du Web et comportant des tableaux. Cet enrichissement est guidée par une ontologie comportant un ensemble de termes et de relations. Le caract`ere automatique de notre approche nous a conduits `a utiliser un format flexible permettant de repréesenter les donnéees (termes ou relations) dont l'identification est partielle, incertaine ou multiple. Cet article se focalise sur la déecouverte dans les tableaux de relations non identifiéees ainsi que sur la déecouverte d'attributs qui peuvent enrichir ou modifier l'interpréetation de relations identifiées.

Distributed Reasoning in a Peer-to-Peer Setting: Application to the Semantic Web

Marie-Christine Rousset, Francois Goasdou, Philippe Adjiman, Philippe Chatalic, Laurent Simon- IEEE Transactions on Knowledge and Data Engineering 2006 - 25:269,314

In a peer-to-peer inference system, each peer can reason locally but can also solicit some of its acquaintances, which are peers sharing part of its vocabulary. In this paper, we consider peer-to-peer inference systems in which the local theory of each peer is a set of propositional clauses defined upon a local vocabulary. An important characteristic of peer-to-peer inference systems is that the global theory (the union of all peer theories) is not known (as opposed to partition-based reasoning systems). The main contribution of this paper is to provide the first consequence finding algorithm in a peer-to-peer setting: it is anytime and computes consequences gradually from the solicited peer to peers that are more and more distant. We exhibit a sufficient condition on the acquaintance graph of the peer-to-peer inference system for guaranteeing the completeness of this algorithm. Another important contribution is to apply this general distributed reasoning setting to the setting of the Semantic Web through the SomeWhere semantic peer-to-peer data management system. The last contribution of this paper is to provide an experimental analysis of the scalability of the peer-to-peer infrastructure that we propose, on large networks of 1000 peers.

Expressive power of pebble automata

Thomas Schwentick, Mathias Samuelides, Mikolaj Bojanczyk, Luc Segoufin - IEEE Transactions on Knowledge and Data Engineering 2006

Two variants of pebble tree-walking automata on trees are considered that were introduced in the literature. It is shown that for each number of pebbles, the two models have the same expressive power both in the deterministic case and in the nondeterministic case. Furthermore, nondeterministic (resp. deterministic) tree-walking automata with n+1 pebbles can recognize more languages than those with n pebbles. Moreover, there is a regular tree language that is not recognized by any tree-walking automaton with pebbles. As a consequence, FO+posTC is strictly included in MSO over trees.

GLAM: A Generic Layered Adaptation Model for Adaptive Hypermedia Systems

Chantal Reynaud, Cedric Jacquiot, Yolaine Bourda, Fabrice Popineau, Alexandre Delteil- IEEE Transactions on Knowledge and Data Engineering 2006 Juny - LNCS:131--140

This paper introduces GLAM, a system based on situation calculus and meta-rules, which is able to provide adaptation by means of selection of actions. It is primarily designed to provide adaptive navigation. The different levels of conception, related to different aspects of the available metadata, are split in different layers in GLAM, in order to ease the conception of the adaptation system as well as to increase the potential use of complex adaptation mechanisms. GLAM uses meta-rules to handle these layers.

Mappings pour l'integration de documents XML

Chantal Reynaud, Brigitte Safar- IEEE Transactions on Knowledge and Data Engineering 2006 Janvier

Ce travail se situe dans le domaine de l'integration de sources d'informations heterogenes. Il vise l'integration d'une nouvelle source XML a un serveur d'information en exploitant une ontologie, c'est-a-dire une description explicite et declarative de la semantique d'un domaine d'application. Nous proposons d'automatiser la generation des mises en correspondance, ou mappings, entre l'ontologie et la source a integrer. Nous montrons que ce processus repose non seulement sur l'exploitation des schemas des modeles a connecter mais aussi sur les donnees associees. Un prototype, DetectHybridemap, developpe en java, est a l'origine d'experimentations sur des donnees reelles fournies par France Telcom R&D dans le domaine du tourisme.

Path summaries and path partitioning in modern XML databases (Poster)

Ioana Manolescu, Angela Bonifati, Andrei Arion, Andrea Pugliese- IEEE Transactions on Knowledge and Data Engineering 2006 :1077-1078

Querying and Updating Probabilistic Information in XML

Serge Abiteboul, Pierre Senellart- IEEE Transactions on Knowledge and Data Engineering 2006 Mar

We present in this paper a new model for representing probabilistic information in a semi-structured (XML) database, based on the use of probabilistic event variables. This work is motivated by the need of keeping track of both confidence and lineage of the information stored in a semi-structured warehouse. For instance, the modules of a (Hidden Web) content warehouse may derive information concerning the semantics of discovered Web services that is by nature not certain. Our model, namely the fuzzy tree model, supports both querying (tree pattern queries with join) and updating (transactions containing an arbitrary set of insertions and deletions) over probabilistic tree data. We highlight its expressive power and discuss implementation issues.

Representing and Querying XML with Incomplete Information

Serge Abiteboul, Victor Vianu, Luc Segoufin- IEEE Transactions on Knowledge and Data Engineering 2006 - 31(1):208--254

e study the representation and querying of XML with incomplete information. We consider a simple model for XML data and their DTDs, a very simple query language, and a representation system for incomplete information in the spirit of the representations systems developed by Imielinski and Lipski for relational databases. In the scenario we consider, the incomplete information about an XML document is continuously enriched by successive queries to the document. We show that our representation system can represent partial information about the source document acquired by successive queries, and that it can be used to intelligently answer new queries. We also consider the impact on complexity of enriching our representation system or query language with additional features. The results suggest that our approach achieves a practically appealing balance between expressiveness and tractability.

SomeWhere: a scalable P2P infrastructure for querying distributed ontologies.

Francois Goasdou, Laurent Simon, Marie-ChristineRousset, P.hilippe Adjiman, Philippe Chatalic - IEEE Transactions on Knowledge and Data Engineering 2006 October:698--703

In this invited talk, we present the SomeWhere approach and infrastructure for building semantic peer-to-peer data management systems based on simple personalized ontologies distributed at a large scale. Somewhere is based on a simple class-based data model in which the data is a set of resource identifiers (e.g., URIs), the schemas are (simple) definitions of classes possibly constrained by inclusion, disjunction or equivalence statements, and mappings are inclusion, disjunction or equivalence statements between classes of different peer ontologies. In this setting, query answering over peers can be done by distributed query rewriting, which can be equivalently reduced to distributed consequence finding in propositional logic. It is done by using the message-passing distributed algorithm that we have implemented for consequence finding of a clause w.r.t a set of distributed propositional theories. We summarize its main properties (soundness, completeness and termination), and we report experiments showing that it already scales up to a thousand of peers. Finally, we mention ongoing work on extending the current data model to RDF(S) and on handling possible inconsistencies between the ontologies of different peers.

Structural techniques for alignment of structurally dissymmetric taxonomies

Chantal Reynaud, Brigitte Safar, Hassen Kefi- IEEE Transactions on Knowledge and Data Engineering 2006 October:39-40

This paper deals with taxonomy alignment and presents the structural techniques of an alignment method suitable with a dissymmetry in the structure of the mapped taxonomies. The aim is to allow a uniform access to ducuments belonging to a same application domain, assuming retrieval of documents is supported by taxonomies. We applied our method to various taxonomies using our prototype TaxoMap.

Structural techniques for Alignment of taxonomies: experiments and evaluation

Chantal Reynaud, Brigitte Safar- IEEE Transactions on Knowledge and Data Engineering 2006 June

This paper deals with taxonomy alignment and presents structural techniques of an alignment method suitable with a dissymmetry in the structure of the mapped taxonomies. The aim is to allow a uniform access to documents belonging to a same application domain, assuming retrieval of documents is supported by taxonomies. We applied our method to various taxonomies using our prototype, TaxoMap. Experimental results are given and analyzed to demonstrate the effectiveness of the alignment approach. We also give comments about test cases and sketch some ideas to make improvements in order to widen the scope of the approach.

Structured Materialized Views for XML Queries

Yannis Papakonstantinou, Veronique Benzaken, Ioana Manolescu, Andrei Arion- IEEE Transactions on Knowledge and Data Engineering 2006

The performance of XML database queries can be greatly enhanced by employing materialized views. We present containment and rewriting algorithms for tree pattern queries that correspond to a large and important subset of XQuery, in the presence of a structural summary of the database (i.e., in the presence of a Dataguide). The tree pattern language captures structural identi?ers and optional nodes, which allow us to translate nested XQueries into tree patterns. We character- ize the complexity of tree pattern containment and rewriting, under the constraints expressed in the structural summary, whose enhanced form also entails integrity constraints. Our approach is implemented in the ULoad [5] prototype and we present a performance analysis.

Techniques structurelles pour l'alignement de taxonomies sur le Web

Chantal Reynaud, Brigitte Safar, Hassen Kefi- IEEE Transactions on Knowledge and Data Engineering 2006

Ce papier porte sur la g

The Semantic Web from an Industrial Perspective

Francois Goasdou, Jean Charlet, Alain Leger, Johannes Heinecke, Lyndon J.B. Nixon, Pavel Shvaiko, Paola Hobson- IEEE Transactions on Knowledge and Data Engineering 2006

Semantic Web technology is being increasingly applied in a large spectrum of applications in which domain knowledge is conceptualized and formalized (e.g., by means of an ontology) in order to support diversified and automated knowledge processing (e.g., reasoning) performed by a machine. Moreover, through an optimal combination of (cognitive) human reasoning and (automated) machine reasoning and processing, it is possible for humans and machines to share complementary tasks. The spectrum of applications is extremely large and to name a few: corporate portals and knowledge management, e-commerce, e-work, e-business, healthcare, e-government, natural language understanding and automated translation, information search, data and services integration, social networks and collaborative filtering, knowledge mining, business intelligence and so on. From a social and economic perspective, this emerging technology should contribute to growth in economic wealth, but it must also show clear cut value for everyday activities through technological transparency and efficiency. The penetration of Semantic Web technology in industry and in services is progressing slowly but accelerating as new success stories are reported. In this paper and lecture we present ongoing work in the cross-fertilization between industry and academia. In particular, we present a collection of application fields and use cases from enterprises which are interested in the promises of Semantic Web technology. The use cases are detailed and focused on the key knowledge processing components that will unlock the deployment of the technology in the selected application field. The paper ends with the presentation of the current technology roadmap designed by a team of Academic and Industry researchers.

Trading with plans and patterns in XQuery optimization

Yannis Papakonstantinou, Veronique Benzaken, Ioana Manolescu, Andrei Arion- IEEE Transactions on Knowledge and Data Engineering 2006 12

We have described a method for producing, during algebraic XQuery optimization, tree patterns equivalent to increasingly larger algebraic plans. The benefit of producing such patterns is to bridge the gap between the seemingly disparate formalisms, and to enable translation of results on containment and equivalence under structural constraints, as well as twig cardinality estimation, directly to algebraic plans. Our approach is fully implemented in ULoad [8] and we have demonstrated its low overhead.

Wen usual structural alignment techniques don't apply

Chantal Reynaud, Brigitte Safar- IEEE Transactions on Knowledge and Data Engineering 2006

This paper deals with taxonomy alignment. It presents structural techniques of an alignment method suitable with a dissymmetry in the structure of the mapped taxonomies. The aim is to allow a uniform access to documents belonging to a same application domain, assuming retrieval of documents is supporting by taxonomies.

A phylogenetic tree for the sat 2002 contest.

D., P., L.- IEEE Transactions on Knowledge and Data Engineering 2005 (43)

A semantic enrichment of data tables applied to food risk assessment

Nathalie Pernelle, Ollivier Haemmerl, Fatiha Sais, Helene Galiardi- IEEE Transactions on Knowledge and Data Engineering 2005 october:374--376

A Unified Tuple-Based Algebra for XQuery

Yannis Papakonstantinou, Ioana Manolescu- IEEE Transactions on Knowledge and Data Engineering 2005 April

A User-centric Framework for Accessing Biological Sources and Tools

Christine Froidevaux, Sarah Cohen-Boulakia, Susan Davidson- IEEE Transactions on Knowledge and Data Engineering 2005

Biologists face two problems in interpreting their experiments: the integration of their data with information from multiple heterogeneous sources and data analysis with bioinformatics tools. It is difficult for scientists to choose between the numerous sources and tools without assistance. Following a thorough analysis of scientistsx{2019} needs during the querying process, we found that biologists express preferences concerning the sources to be queried and the tools to be used. Interviews also showed that the querying process itself the strategy followed differs between scientists. In response to these findings, we have introduced a user-centric framework allowing to specify various querying processes. Then we have developed the BioGuide system which helps the scientists to choose suitable sources and tools, find complementary information in sources, and deal with divergent data. It is generic in that it can be adapted by each user to provide answers respecting his/her preferences, and obtained following his/her strategies.

Active Context-Free Games

Luc Segoufin, Thomas Schwentick, Anca Muscholl- IEEE Transactions on Knowledge and Data Engineering 2005

An Active Context-Free Game is a game with two players I and II on strings over a finite alphabet. In each move, I selects a position of the current word and II rewrites the corresponding letter according to a rule of a context-free grammar. I wins if a string of the regular target language is reached. The complexity of deciding winning strategies for I is investigated, depending on properties of the grammar, of the target language, and on restrictions on the strategy.

Alpha Galois Lattices: an overview

Ventos, V., H. Soldano- IEEE Transactions on Knowledge and Data Engineering 2005 february:to appear

Alpha Galois Lattices: an overview

Veronique Ventos, Henry Soldano- IEEE Transactions on Knowledge and Data Engineering 2005 (3403):298-313

An automatic ontology-based approach to enrich tables semantically

Nathalie Pernelle, Ollivier Haemmerl, Fatiha Sais, Helene Galgliardi- IEEE Transactions on Knowledge and Data Engineering 2005 july:64--71

Building an Active Content Warehouse

Serge Abiteboul, Benjamin Nguyen, Gabriela Ruberg- IEEE Transactions on Knowledge and Data Engineering 2005

A chapter in the book Managing and Processing Complex Data for Decision Support, J. Darmont and O. Boussa

Complementing deterministic tree-walking automata

Luc Segoufin, Anca Muscholl, Mathias Samuelides- IEEE Transactions on Knowledge and Data Engineering 2005

We consider various kinds of deterministic tree-walking automata, with and without pebbles, over ranked and unranked trees. For each such kind of automata we show that there is an equivalent one which never loops. The main consequence of this result is the closure under complementation of the various types of automata we consider with a focus on the number of pebbles used in order to complement the automata.

Constructing and Querying Peer-to-Peer Warehouses of XML Resources

Serge Abiteboul, Ioana Manolescu, Nicoleta Preda- IEEE Transactions on Knowledge and Data Engineering 2005 April:1122-1123

Construction automatis

Gloria-Lucia Giraldo-Gomez- IEEE Transactions on Knowledge and Data Engineering 2005

Diagnosis of Asynchronous Discrete event systems. Datalog to the rescue!

Serge Abiteboul, T. Milo, Z. Abrams, S. Haar- IEEE Transactions on Knowledge and Data Engineering 2005

We study in this paper the diagnosis of distributed telecommunication systems. We show that (i) the problem can be modeled using Datalog programs, and (ii) it can benefit from the large battery of optimization techniques developed for Datalog. In particular, we show some surprising relationships between techniques used for solving efficiently the diagnosis problem of telecommunication systems and a known Datalog optimization technique, namely Query-sub-query (QSQ). We adapt QSQ to a distributed setting. We show that a simple ``generic'' use of this extended QSQ achieves an optimization as good as that previously provided by the dedicated diagnosis algorithms. Furthermore, we show that it allows solving efficiently a much larger class of system analysis problems.

EDOS: Environment for the Development and Distribution of Open Source Software

Serge Abiteboul, Tova Milo, Xavier Leroy, Boris Vrdoljak, Roberto Di Cosmo, St Fermigier, St Lauri, Fr Lepied, Radu Pop, Florent Villard, Jean-Paul Smets, Ciar Bryce, Klaus R. Dittrich, Assaf Sagi, Yotam Shtossel, Eleonora Panto- IEEE Transactions on Knowledge and Data Engineering 2005

The open-source software community is now comprised of a very large and growing number of contributors and users. The GNU/Linux operating system for instance has an estimated 18 million users worldwide and its contributing developers can be counted by thousands. The critical mass of contributors taking part in various opensource projects has helped to ensure high quality for open source software. However, despite the achievements of the open-source software industry, there are issues in the production of large scale open-source software (OSS) such as the GNU/Linux operating system that have to be addressed as the numbers of users, of contributors, and of available applications grow. EDOS is a European project supported by IST started October 2004 and ending in 2007, whose objective is to provide a new generation of methodologies, theoretical models, technical tools and quality models specifically tailored to OSS engineering and to software distribution over the Internet.

Efficient mining of high branching factor attribute trees

Marie-Christine Rousset, Alexandre Termier, Mich Sebag, Kouzou Ohara, Takashi Washio, Hiroshi Motoda- IEEE Transactions on Knowledge and Data Engineering 2005

In this paper, we present a new tree mining algorithm, DRYADEPARENT, based on the hooking principle first introduced in DRYADE. In the experiments, we demonstate that the branching factor and depth of the frequent patterns to find are key factors of complexity for tree mining algorithms. We whow that DRYADEPARENT outperforms the current fastest algorithm, CMTreeMiner, by orders of magnitude on datasets where the frequent patterns have a high branching factor.

Enriching a relational datawarehouse by integrating XML data : report ont the edot project

Nathalie Pernelle, Marie-Christine Rousset, H Gagliardi, Ollivier Haemmerl, Fatiha Sais, Damiano Migliori- IEEE Transactions on Knowledge and Data Engineering 2005

Enrichissement s

Nathalie Pernelle, H Gagliardi, Ollivier Haemmerl, Fatiha Sais- IEEE Transactions on Knowledge and Data Engineering 2005 january - 2:407--419

Ce travail a pour objectif la construction automatique d'un entrepot thematique de donnees, a partir de documents de format divers provenant du Web. L'exploitation de cet entrepot est assuree par un moteur d'interrogation fonde sur une ontologie. Notre attention porte plus precisement sur les tableaux extraits de ces documents et convertis au format XML, aux tags exclusivement syntaxiques. Cet article presente la transformation de ces tableaux, sous forme XML, en un formalisme enrichi semantiquement dont la plupart des tags et des valeurs sont des termes construits a partir de l'ontologie.

Exchanging Intensional XML Data

Serge Abiteboul, Bernd Amann, Tova Milo, Omar Benjelloun, Fred Dang Ngoc- IEEE Transactions on Knowledge and Data Engineering 2005 March - 30(1):1-40

XML is becoming the universal format for data exchange between applications. Recently, the emergence of Web services as standard means of publishing and accessing data on the Web introduced a new class of XML documents, which we call intensional documents. These are XML documents where some of the data is given explicitly while other parts are defined only intensionally by means of embedded calls to Web services. When such documents are exchanged between applications, one has the choice of whether or not to materialize the intensional data (i.e., to invoke the embedded calls) before the document is sent. This choice may be influenced by various parameters, such as performance and security considerations. This article addresses the problem of guiding this materialization process. We argue that

Filtering Web Documents for a Thematic Warehouse, Case Study : eDot a Food Risk Data (extended)

Amar-Djalil Mezaour- IEEE Transactions on Knowledge and Data Engineering 2005 June 13-16 - ??:269-278

Ordinary sources, like databases and general-pupose document collections, seems to be insufficient and inadequate to scale the needs and the requirements of the new generation of warehouses: thematic data warehouses. Knowing that more and more online thematic data is available, the web can be considered as a useful data source for populating thematic data warehouses. To do so, the warehouse data supplier must be able to filter the heterogeneous web content to keep only the documents corresponding to the warehouse topic. Therefore, building efficient automatic tools to characterize web documents dealing with a given thematic is essential to challenge the warehouse data acquisition issue. In this paper, we present our filtering approach implemented in an automatic tool called "eDotFilter". This tool is used to filter crawled documents to keep only the documents dealing with food risk. These documents are then stored in a thematic warehouse called "eDot". Our filtering approach is based on "WeQueL", a declarative web query langage that improves the expressive power of keyword-based queries.

Identifying Websites with Flow Simulation

Pierre Senellart- IEEE Transactions on Knowledge and Data Engineering 2005 April

We present in this paper a method to discover the set of webpages contained in a logical website, based on the link structure of the Web graph. Such a method is useful in the context of Web archiving and website importance computation. To identify the boundaries of a website, we combine the use of an online version of the preflow

Identifying Websites with Flow Simulation

Pierre Senellart- IEEE Transactions on Knowledge and Data Engineering 2005 July:124--129

We present in this paper a method to discover the set of webpages contained in a logical website, based on the link structure of the Web graph. Such a method is useful in the context of Web archiving and website importance computation. To identify the boundaries of a website, we combine the use of an online version of the preflow-push algorithm, an algorithm for the maximum flow problem in traffic networks, and of the Markov CLuster (MCL) algorithm. The latter is used on a crawled portion of the Web graph in order to build a seed of initial webpages, a seed which is extended using the former. An experiment on a subsite of the INRIA Website is described.

Integration of SYSTRAN MT systems in an open workflow

Pierre Senellart, Matts Attn, Jean Senellart- IEEE Transactions on Knowledge and Data Engineering 2005 September

A general mature rule-based MT system is bound to reach a saturation point because of the intrinsic complexity of the natural language description. For such systems, maintenance is a complex task and customization is expensive and time-consuming. Therefore, improving the system's interaction with the linguistic rules has proven to be more productive than endlessly adding new rules and exceptions to reach the theoretical accuracy level of 100%. In this paper, we describe our strategy to "open up" such a system and provide practical details on how this was done on SYSTRAN's classical engines. This approach is the foundation of the SYSTRAN version 5 translation engines. We show the immediate benefits of the evolution of our architecture and the new extension potentiality.

Knowledge Management by Querying Relational Views of XML data: application to Microbiology

Marie-Christine Rousset, Ollivier Haemmerl, Damiano Migliori- IEEE Transactions on Knowledge and Data Engineering 2005

In this paper we present a method for integrating and querying XML data in a relational setting. thouigh this method is generic, it has been motivated and validated by a knowledge management application on Microbiology: the e.dot projet.

Logical Design of Data Warehouses from XML

Boris Vrdoljak, Marko Banek, Zoran Skocir- IEEE Transactions on Knowledge and Data Engineering 2005 - 1:289-295

Data warehouse is a database that collects and integrates data from heterogeneous sources in order to support a decision making process.Data exchanged over the Internet and intranets has recently become an important data source, having XML as a standard format for exchange. The possibility of integrating available XML data into data warehouses plays an important role in providing enterprise managers with up-to-date and relevant information about their business domain. We have developed a methodology for data warehouse design from the source XML Schemas and conforming XML documents. As XML data is semi-structured, data warehouse design from XML brings many particular challenges. In this paper the final steps of deriving a conceptual multidimensional scheme are described, followed by the logical design, where a set of tables is created according to the derived conceptual scheme. A prototype tool has been developed to test and verify the proposed methodology.

MemBeR: A Micro-Benchmark Repository for XQuery

Ioana Manolescu, Loredana Afanasiev, Philippe Michiels- IEEE Transactions on Knowledge and Data Engineering 2005 September - 3671:144-161

Model-Driven Design and Deployment of Service-Enabled Web Applications

Stefano Ceri, Marco Brambilla, Ioana Manolescu, Sara Comai and Piero Fraternali- IEEE Transactions on Knowledge and Data Engineering 2005 August - 5(3)

Path Summaries and Path Partitioning in Modern XML Databases

Ioana Manolescu, Angela Bonifati, Andrei Arion, Andrea Pugliese- IEEE Transactions on Knowledge and Data Engineering 2005 November

We study the applicability of XML path summaries in the context of current-day XML databases. We find that summaries provide an excellent basis for optimizing data access methods, which furthermore mixes very well with path-partitioned store. We provide practical algorithm for building and exploiting summaries, and prove its benefits through extensive experiments .

Pr

Amar-Djalil Mezaour, Alphonse, Ahmed Amrani, Jerome Aze, Thomas Heitz, Mathieu Roche- IEEE Transactions on Knowledge and Data Engineering 2005 june

Querying and Updating Probabilistic Information in XML

Serge Abiteboul, Pierre Senellart- IEEE Transactions on Knowledge and Data Engineering 2005 Dec

We present in this paper a new model for representing probabilistic information in a semi-structured (XML) database, based on the use of probabilistic event variables. This work is motivated by the need of keeping track of both confidence and lineage of the information stored in a semi-structured warehouse. For instance, the modules of a (Hidden Web) content warehouse may derive information concerning the semantics of discovered Web services that is by nature not certain. Our model supports both querying (tree pattern queries with join) and updating (transactions containing an arbitrary set of insertions and deletions) over probabilistic tree data. We show that the model is as expressive as the impractical possible worlds model and that it is more expressive than a simple model where probability (confidence) is stored at each node of the tree. We briefly discuss implementation issues.

Recherche cibl

Amar-Djalil Mezaour- IEEE Transactions on Knowledge and Data Engineering 2005

Regular Rewriting of Active XML and Unambiguity

Serge Abiteboul, Tova Milo, Omar Benjelloun- IEEE Transactions on Knowledge and Data Engineering 2005

We consider here the exchange of Active XML (AXML) data, i.e., XML documents where some of the data is given explicitly while other parts are given only intensionally as calls to Web services. Peers exchanging AXML data agree on a {em data exchange schema} that specifies in particular which parts of the data are allowed to be intensional. Before sending a document, a peer may need to {em rewrite} it to match the agreed data exchange schema, by calling some of the services and {em materializing} their data. Previous works showed that the rewriting problem is undecidable in the general case and of high complexity in some restricted cases. We argue here that this difficulty is somewhat artificial. Indeed, we study what we believe to be a more adequate, from a practical view point, rewriting problem that is (1) in the spirit of standard 1-unambiguity constraints imposed on XML schema and (2) can be solved by a single pass over the document with a computational device not stronger than a finite state automaton. Following previous works, we focus on the core of the problem, i.e., on the problem on {em words}. The results may be extended to (A)XML trees in a straightforward manner.

Regular tree languages definable in FO

Luc Segoufin, Michael Benedikt- IEEE Transactions on Knowledge and Data Engineering 2005

The conference version was incorrect and has been removed. The file is the journal version.

Scalability Study of Peer-to-Peer Consequence Finding

Marie-Christine Rousset, Francois Goasdou, Philippe Adjiman, Philippe Chatalic, Laurent Simon- IEEE Transactions on Knowledge and Data Engineering 2005

In peer-to-peer inference systems, each peer can reason locally but also solicit some of its acquaintances, sharing part of its vocabulary. This paper studies both theoretically and experimentally the problem of computing proper prime implicates for propositional peer-to-peer systems, the global theory (union of all peer theories) of which is not known (as opposed to partition-based reasoning).

Secure Exchange of Modifiable Data and Queries

Serge Abiteboul, Tova Milo, Haim Kaplan, Bogdan Cautis, Amos Fiat- IEEE Transactions on Knowledge and Data Engineering 2005

We consider the secure exchange of information in a multi-party environment, where intermediaries can modify data and queries in a controlled way, e.g., by filtering out some information or strengthening some query. We propose a data exchange model that permits the specification of admissible updates, along with an array of cryptography-based techniques to validate updates and to guarantee the provenance of data and queries in different scenarios. In particular, we consider the cases where one insists on exposing or hiding the identity of peers that performed changes, and the changes they performed. We also consider the packaging of signatures with the data (to be able to verify it) and interactive scenarios where validity is proved via message exchange. The underlying thesis developed in the paper is that a newly proposed model, consisting of XML documents with embedded calls to Web services (namely Active XML), is an appropriate paradigm to manage secured data exchange in a distributed setting. We show that Active XML enables addressing in a uniform manner, in a dynamic and multi-party setting, issues such as provenance, authenticity of (portions of) documents, access control, and query processing; issues typically ignored or considered separately by traditional data exchange models.

Sharing Content in Structured P2P Networks

Serge Abiteboul, Ioana Manolescu, Nicoleta Preda- IEEE Transactions on Knowledge and Data Engineering 2005 October

The current development of Web standards and technologies has brought new opportunities for large-scale integration of Web resources: data sources (such asXML, HTML, or PDF files), distributed applications(typically accessed via Web Services), and semantic information (represented in formats such as OWL orRDF). At the same time, new peer-to-peer platforms are being developed, and increasingly used for data managementat the network scale. We present KADOP, a model, architecture, and system for sharing content in a peer-to-peer environment. By building on distributed hash tables technology, XML indexing and query optimization techniques, and on the paradigm of active documents, KADOP enables the publication, discovery, and efficient exploitation of content in a P2P environment. In some sense, with KADOP, the network can be seen as a content warehouse (for a domain of interest) as well as an entry point to external resources, in mediator style.The system presented here as been implemented and is being tested. A novel aspect of the work is the use of active documents, to: (i) automatically maintain and refresh the distributed index, (ii) greatly reduce its overall space occupancy, (iii) provide bridges to other data repositories, such as databases or other peer-to-peernet works, and (iv) dynamically combine data-centric Web services to answer complex queries. Keywords: XML, ActiveXML, data integration, Web services, DHT, intensional indexing, XML queryprocessing.

SomeWhere in the Semantic Web

Marie-Christine Rousset, Francois Goasdou, Philippe Adjiman, Philippe Chatalic, Laurent Simon- IEEE Transactions on Knowledge and Data Engineering 2005

In this paper, we describe the SomeWhere semantic peer-to-peer data management system that promotes a "small is beautiful" vision of the Semantic Web based on simple personalized ontologies (e.g., taxonomies of classes) but which are distributed at a large scale. In this vision of the Semantic Web, no user imposes to others his own ontology. Logical mappings between ontologies make possible the creation of a web of people in which personalized semantic marking up of data cohabits nicely with a collaborative exchange of data. In this view, the Web is a huge peer-to-peer data management system based on simple distributed ontologies and mappings.

SYSTRAN Translation Stylesheets: Machine Translation driven by XSLT

Pierre Senellart, Jean Senellart- IEEE Transactions on Knowledge and Data Engineering 2005 November

We present the SYSTRAN Translation Stylesheets (STS) used by the SYSTRAN machine translation engines that power the most prominent online translation services on the Web today (including Google, Yahoo!, AltaVista's BabelFish, and AOL). SYSTRAN develops and markets the world's most widely used machine translation software which powers products and solutions for the desktop, network infrastructures and the Internet that facilitate communication in 40 language combinations and in 20 specialized domains. The choice of leading global corporations, search engines and governments, SYSTRAN's software is applied across diverse best-practice solutions for intra-company communications, content management, online customer support, eCommerce, email systems, chat, and more. SYSTRAN's expertise spans over three decades of building customized translation solutions through open and robust architectures. XSL Transformation stylesheets are usually used to transform a document described in an XML formalism into another XML formalism, to modify an XML document, or to publish content stored into an XML document to a publishing format (XSL-FO, (X)HTML...). SYSTRAN Translation Stylesheets (STS) use XSLT to drive and control the machine translation of XML documents (native XML document formats or XML representations - such as XLIFF - of other kinds of document formats). STS do not only provide a simple way to indicate which part of the document text is to be translated, but also enables the fine-tuning of translation, especially by using the structure of the document to help disambiguate natural language semantics and determine proper context. For instance, the phrase "Access From Front Door" is to be analyzed as "The access from front door" within a title, and as "Do access (something) from front door" in the text body. In that case, the STS would pass a title option to the translation engine. The stylesheet can activate specialized domain dictionaries for some parts of the document and can mark some expressions as not to be translated, in the same manner. Another key application of STS is to consider machine translation as part of the authoring and publishing process: source documents can be annotated with natural language markup produced by the author, markup which will be processed by STS to improve the quality of translation, the gateway to the automatic publishing of a multilingual website from a monolingual (annotated) source. This composition, inside the publishing process, is a real breakthrough for Web content translation. Traditionally, machine translation is applied after publishing and does not have access to the original structure of the document, but only to its HTML representation. The mechanism is implemented through XSLT extension functions. In particular, the stylesheet uses a systran:translate function to translate an XML fragment, and systran:getValue/systran:pushValue/systran:popValue functions for consulting and for setting linguistics options in the translation engine. Proper management of character properties is also provided so that, for instance, the translation of a phrase in bold font will appear in bold font, even if the phrase has moved within the translated sentence. This process is highly customizable by the addition of new templates into the stylesheets. Because the translation is driven by the document structure, it is easier to leverage this structure during the translation process and to keep it in the translated document, than with traditional document filters, which process the entire document linearly. STS are part of SYSTRAN's commercial version 5.0 product line and are also used for and by specific corporate-customers.

The Active XML project: an overview

Serge Abiteboul, Tova Milo, Omar Benjelloun- IEEE Transactions on Knowledge and Data Engineering 2005 10

This paper provides an overview of the Active XML project developed at INRIA over the past three years. Active XML (AXML, for short), is a declarative framework that harnesses web services for data integration, and is put to work in a peer-to-peer architecture. An AXML document is an XML document that may contain calls to Web services. An AXML "peer" is a repository of AXML documents. It acts both as a client, invoking the Web services embedded in the documents, and as a server, providing Web service defined as queries over the active documents it contain. The approach gracefully combines stored information with data defined in an intentional manner as well as dynamic information. This simple rather classical idea leads to a number of technically challenging problems, both theoretical and practical, which will be overviewed in this paper. We will describe the AXML model and language, overview the research results obtained in the course of the project, and show how the various pieces are assembled together in the system implementation.

The Lowell database research self-assessment

Serge Abiteboul, Hector Garcia-Molina, J. Widom, J. D. Ullman, Stefano Ceri, Rakesh Agrawal, Philip A. Bernstein, Michael J. Carey, W. Bruce Croft, David J. DeWitt, Michael J. Franklin, Dieter Gawlick, Jim Gray, Laura M. Haas, Alon Y. Halevy, Joseph M. Hellerstein, Yannis E. Ioannidis, Martin L. Kersten, Michael J. Pazzani, Michael Lesk, David Maier, Jeffrey F. Naughton, Hans-J Schek, Timos K. Sellis, R. Silberschatz, M. Stonebraker, R. T. Snodgrass, G. Weikum, S. Zdonik- IEEE Transactions on Knowledge and Data Engineering 2005 - 48(5):111-118

A group of senior database researchers gathers every few years to assess the state of database research and to point out problem areas that deserve additional focus. This report summarizes the discussion and conclusions of the sixth ad-hoc meeting held May 4-6, 2003 in Lowell, Mass. It observes that information management continues to be a critical component of most complex software systems. It recommends that database researchers increase focus on: integration of text, data, code, and streams; fusion of information from heterogeneous data sources; reasoning about uncertain data; unsupervised data mining for interesting correlations; information privacy; and self-adaptation and repair.

The Parallel Complexity of XML Typing and XPath Query Evaluation

Luc Segoufin, Georg Gottlob, Christoph Koch, Reinhard Pichler- IEEE Transactions on Knowledge and Data Engineering 2005 - 52(2):284--335

We study the complexity of two central XML processing problems. The first is XPath 1.0 query processing, which has been shown to be in PTime in previous work. We prove that boththe data complexity and the query complexity of XPath 1.0 fall into lower (highly parallelizable) complexity classes, while thecombined complexity is PTime-hard. Subsequently, we study the sources of this hardness and identify a large and practicallyimportant fragment of XPath 1.0 for which the combined complexity is LogCFL-complete and, therefore, in the highly parallelizablecomplexity class NC2.The second problem is the complexity of validating XML documents against various typing schemes like Document Type Definitions (DTDs),XML Schema Definitions (XSDs), and tree automata, both with respect to data and to combined complexity.For data complexity, we prove that validation is in LogSpace and depends crucially on how XML data is represented. For the combined complexity, we show that the complexity ranges from LogSpace to LogCFL, depending on the typing scheme.

The SAT 2002 competition.

D. Le Berre- IEEE Transactions on Knowledge and Data Engineering 2005 :343--378

SAT Competition 2002 held in March-May 2002 in conjunction with SAT 2002 (the Fifth International Symposium on the Theory and Applications of Satisfiability Testing). About 30 solvers and 2300 benchmarks took part in the competition, which required more than 2 CPU years to complete the evaluation. In this report, we give the results of the competition, try to interpret them, and give suggestions for the future competitions.

Treillis de Galois Alpha

Veronique Ventos, Henry Soldano- IEEE Transactions on Knowledge and Data Engineering 2005 (19):799--827

ULoad: Choosing the Right Storage for Your XML Application

Veronique Benzaken, Ioana Manolescu, Andrei Arion, Ravi Vijay- IEEE Transactions on Knowledge and Data Engineering 2005 August:1330-1333

Un modele de stockage de donnees XML base sur les sequences

Ioana Manolescu, Angela Bonifati, Andrei Arion, Andrea Pugliese- IEEE Transactions on Knowledge and Data Engineering 2005 Octobre - 2

Une expérience de représentation d'une ontologie dans le mediateur PICSEL

Chantal Reynaud, Brigitte Safar, H Gagliardi- IEEE Transactions on Knowledge and Data Engineering 2005

Building an information server, specific to a givendomain, over distributed and heterogeneous information sources,is based on knowledge describing the server domain, calleddomain ontology. We address the issue of representing suchontologies in the framework of the PICSEL research project.Two steps are described. The first one is directed by therepresentation formalism and the functionalities of theinformation server. The second one aims at refining andoptimizing knowledge obtained in step one. It is guided by theway functionalities of the server are implemented. The examplesin the paper are coming from the tourism products domain.

Views and Queries: Determinacy and Rewriting

Victor Vianu, Luc Segoufin- IEEE Transactions on Knowledge and Data Engineering 2005

We investigate the question of whether a query Q can be answered using a set V of views. We first define the problem in information-theoretic terms: we say that V determines Q if V provides enough information to uniquely determ ine the answer to Q. Next, we look at the problem of rewriting Q in terms of V using a specif ic language. Given a view language VV and query language QQ, we say that a rewriting language RR is complete for VV-to-QQ rewritings if every Q in QQ can be rewritten in terms of V in VV using a query in RR, whenever V determines Q. While query rewriting using views has been extensively investigated for some specific languages, the connection to the info rmation-theoretic notion of determinacy, and the question of completeness of a rewriting language, have received little a ttention. In this paper we investigate systematically the notion of determinacy and its co nnection to rewriting. The results concern decidability of determinacy for various view and query languages, as well as the power requir ed of complete rewriting languages. We consider languages ranging from first-order to conjunctive queries.

XML Access Modules: Towards Physical Data Independence in XML Databases

Veronique Benzaken, Ioana Manolescu, Andrei Arion- IEEE Transactions on Knowledge and Data Engineering 2005 June

XML data integration with identification

Serge Abiteboul, Antonella Poggi- IEEE Transactions on Knowledge and Data Engineering 2005

Data integration is the problem of combining data residing at different sources, and providing the user with a virtual view, called global schema, which is independent from the model and the physical origin of the sources. Whereas many data integration systems and theoretical works have been proposed for relational data, not much investigation has focused yet on XML data integration. Our goal is therefore to address some of its related issues. In particular, we highlight two major issues that emerge in the XML context: (i) the global schema may be characterized by a set of constraints, expressed by means of a DTD and XML integrity constraints, (ii) the concept of node identity requires to introduce semantic criteria to identify nodes coming from different sources. We propose a formal framework for XML data integration systems based on an expressive XML global schema, a set of XML data sources and a set of mappings specified by means of a simple tree language. Then, we define an identification function that aims at globally identifying nodes coming from different sources. Finally, we propose algorithms to answer queries under different assumptions for the mappings.

XML Warehousing Meets Sociology

Benjamin Nguyen, Ioana Manolescu, Francois-Xavier Dudouet, Pierre Senellart- IEEE Transactions on Knowledge and Data Engineering 2005 October

In this article, we describe a novel application of XML and Web based technologies: a sociological study of the W3C standardization process. We propose a new methodology and tools, to be used by sociologists to study the standardization process, illustrated by the W3C XQuery Working Group. The novelty of our approach has many facets. Information Technology (IT) has received little attention from sociologists, yet the standardization of the Web is a crucial issue, both economical and political, based on the use of a semi-structured content warehouse. We introduce a modeling and querying approach of an XML content warehouse, and show it produces high added-value information. This information is used to conduct a preliminary sociological analysis of the XQuery standardization process.

XML Warehousing Meets Sociology... Introducing the W3C XQuery Working Group

Benjamin Nguyen, Ioana Manolescu, Francois-Xavier Dudouet, Pierre Senellart- IEEE Transactions on Knowledge and Data Engineering 2005 April

In this article, we describe a novel application of XML and Web based technologies: a sociological study of the W3C standardization process. We propose a new methodology and tools, to be used by sociologists to study the standardization process, illustrated by the W3C XQuery Working Group. The novelty of our approach has many facets. Information Technology (IT) has received little attention from sociologists, yet the standardization of the Web is a crucial issue, both economical and political, based on the use of a semi-structured content warehouse. We introduce a modeling and querying approach of an XML content warehouse, and show it produces high added-value information. This information is used to conduct a preliminary sociological analysis of the XQuery standardization process.

XQuery Midflight: Emerging Database-Oriented Paradigms and a Classification of Research Advances

Yannis Papakonstantinou, Ioana Manolescu- IEEE Transactions on Knowledge and Data Engineering 2005 April:1143-1143

A Test Platform for the INEX Heterogeneous Track

Serge Abiteboul, Ioana Manolescu, Nicoleta Preda, Benjamin Nguyen- IEEE Transactions on Knowledge and Data Engineering 2004 December - 3493:358-371

This article presents our work within the INEX 2004 Heterogeneous Track. We focused on taming the structural diversity within the INEX heterogeneous bibliographic corpus. We demonstrate how semantic models and associated inference techniques can be used to solve the problems raised by the structural diversity within a given XML corpus. The first step automatically extracts a set of concepts from each class of INEX heterogeneous documents. An unified set of concepts is then computed, which synthesizes the interesting concepts from the whole corpus. Individual corpora are connected to the unified set of concepts via conceptual mappings. This approach is implemented as an application of the KadoP platform for peer-to-peer warehousing of XML documents. While this work caters to the structural aspects of XML information retrieval, the extensibility of the KadoP system makes it an interesting test platform in which components developed by several INEX participants could be plugged, exploiting the opportunities of peer-to-peer data and service distribution.

Active Context-Free Games

Luc Segoufin, Thomas Schwentick, Anca Muscholl- IEEE Transactions on Knowledge and Data Engineering 2004

An Active Context-Free Game is a game with two players (Romeo and Juliet) on strings over a finite alphabet. Its rules aredefined by a context-free grammar. In each move, Juliet selects a position of the current word and Romeo rewrites thecorresponding letter according to a rule of the grammar. Juliet wins if a string of the regular target language is reached. Weconsider the complexity of deciding and extracting winning strategies for Juliet depending on properties of the grammar, ofthe regular target language, and on restrictions on the strategy.

Active XML and active query answers

Serge Abiteboul, Tova Milo, Omar Benjelloun- IEEE Transactions on Knowledge and Data Engineering 2004 - 3055:17--27

XML and Web services are revolutioning the automatic management of distributed information, somewhat in the same way that HTML, Web browser and search engines modified human access to world wide information. We argue in this paper that the combination of XML and Web services allows for a novel query answering paradigm, where the exchanged information mixes materialized and intensional, active, information. We illustrate the flexibility of this approach by presenting Active XML that is based on embedding Web service calls in XML data. We discuss advantages of an approach of query answering based on intensional and active data.

Active XML, Security and Access Control

Serge Abiteboul, Tova Milo, Omar Benjelloun, Bogdan Cautis- IEEE Transactions on Knowledge and Data Engineering 2004

XML and Web services are revolutioning the automatic management of distributed information, somewhat in the same way that HTML, Web browsers and search engines modified human access to world wide information. We argue in this paper that the combination of XML and Web services allows for a novel distributed data management paradigm, where the exchanged information mixes materialized and intensional, active, information. We illustrate the flexibility of this approach by presenting Active XML, a language that is based on embedding Web service calls in XML data. We focus on two particular issues, namely security and access control.

Active XML: A data-centric perspective on Web services

Omar Benjelloun- IEEE Transactions on Knowledge and Data Engineering 2004 September

This thesis introduces Active XML (AXML, for short), a declarative framework that harnesses Web services for distributed data management, and is put to work in a peer-to-peer architecture. An AXML document is an XML document that may contain embedded calls to Web services, whose invocation enriches the document. An AXML service is a Web service that exchanges AXML documents. An AXML "peer" is a repository of AXML documents. On the one hand, it acts as a client, by invoking the service calls embedded in its documents. On the other hand, a peer acts as a server, by providing AXML services that can be declaratively specified as queries or updates over the AXML documents of its repository. The AXML approach allows for gracefully combining stored information with data defined in an intensional manner (as service calls). The fact that AXML peers can exchange a mix of materialized and intensional data (via AXML documents) leads to a very powerful distributed data management paradigm. The AXML approach leads to a number of important problems that are studied in the thesis. First, we address the issue of controlling the exchange of AXML data. We propose to use declarative schema specifications, and provide algorithms to statically enforce them. Second, we propose techniques for the "lazy evaluation" of queries on AXML documents, that detect which embedded service calls may contribute to query answers. An implementation of AXML peers compliant with W3C standards is also described in the thesis.

Active XML: A Data-Centric Perspective on Web Services

Serge Abiteboul, Tova Milo, Omar Benjelloun, Ioana Manolescu, Roger Weber- IEEE Transactions on Knowledge and Data Engineering 2004 (3-540-40676-X)

We propose in this chapter a peer-to-peer architecture that allows for the integration of distributed data and web services. It relies on a language, Active XML, where (1) documents embed calls to web services that are used to enrich them, and (2) new web services may be defined by XQuery queries on such active documents. Embedding calls to functions or even to web services inside data is not a new idea. Our contribution, however, is to turn them into a powerful tool for data and services integration. In particular, the language includes linguistic features to control the timing of service call activations. Various scenarios are captured, such as mediation, data warehousing, and distributed computation. A first prototype is also described.

Alpha Galois Lattices

Veronique Ventos, Henry Soldano, Thibaut Lamadon- IEEE Transactions on Knowledge and Data Engineering 2004 november:555-558

In many applications there is a need to represent a large number of data by clustering them in a hierarchy of classes described at an appropriate level of abstraction. Our basic representation is a Galois lattice, i.e. a lattice in which the terms of a representation language are partitioned into equivalence classes w.r.t. their extent (the extent of a term is the part of the instance set that satisfies the term). The advantage of such a structure is that it exhaustively represents the whole set of concepts that are distinguishable given the instance set and the representation language. However the size of the lattice can be very large when dealing with real-world problems. What we propose here is to reduce the size of the lattice, and thus to simplify our view of the data, still conserving its formal structure and exhaustivity. For that purpose we use a preliminary partition of the instance set, representing the association of a "type" to each instance. By redefining the notion of extent of a term in order to cope, to a certain degree (denoted as alpha), with this partition, we define a particular family of Galois lattices denoted as particular Galois lattices denoted as Alpha Galois lattices. We also discuss the related implication rules defined as inclusion of such alpha-extents.

An Electronic Patient Record on Steroids: Distributed, Peer-to-Peer, Secure and Privacy-conscious

Serge Abiteboul, Tova Milo, Omar Benjelloun, Irini Fundulaki, Bogdan Cautis, Bogdan Alexe, Arnaud Sahuguet- IEEE Transactions on Knowledge and Data Engineering 2004

Answering Queries using Views: a KRDB Perspective for the Semantic Web

Marie-Christine Rousset, Francois Goasdou- IEEE Transactions on Knowledge and Data Engineering 2004 - 4(3):255,288

In this paper, we investigate a first step towards the long-term vision of the Semantic Web by studying the problem of answering queries posed through a mediated ontology to multiple information sources whose content is described as views over the ontology relations. The contributions of this paper are twofold. We first offer a uniform logical setting which allows us to encompass and to relate the existing work on answering and rewriting queries using views. In particular, we make clearer the connection between the problem of rewriting queries using views and the problem of answering queries using extentions of views. Then we focus on an instance of the problem of rewriting conjunctive queries using views through an ontology expressed in a description logic, for which we exhibit a complete algorithm.

Building scalable mediator systems

Chantal Reynaud- IEEE Transactions on Knowledge and Data Engineering 2004

This paper deals with PICSEL mediator systems which integrate services. We propose a scalable approach which exploits standardized specifications provided by normalization organisms. The paper focuses on the use of such specifications to automate the construction of the mediator. An illustration in the tourism domain with OTA specifications is given.

Constructing and querying peer-to-peer warehouses of XML resources

Serge Abiteboul, Ioana Manolescu, Nicoleta Preda- IEEE Transactions on Knowledge and Data Engineering 2004

We present KadoP, a distributed infrastructure for warehousing XML resources in a peer-to-peer framework. KadoP allows users to build a shared, distributed repository of resources such as XML documents, semantic information about such documents, Web services, and collections of such items. KadoP builds on distributed hash tables as a peer communication layer, and ActiveXML as a model for constructing and querying the resources in the peer network. We describe KadoP's data model, query language, and query processing paradigm.

Construction d

Nicoleta Preda- IEEE Transactions on Knowledge and Data Engineering 2004 september

Nous presentons KadoP, une infrastructure distribuee pour entreposer des ressources XML dans un environement pair-a-pair. KadoP permet aux utilisateurs de construire un entreot partage et distribue de ressources telles que des documents XML, des informations semantiques sur ces documents, des services Web et des collections de tels objets. KadoP utilise des tables de hachage pour la communication entre les pairs et ActiveXML comme modele pour construire et interroger les ressources du reseau de pairs.Nous decrivons le modele de donnees de KadoP, le langage d'interrogation et le paradigme de traitement des requetes.

Declarative Specification of Web Applications exploiting Web Services and Workflows

Ioana Manolescu, Stefano Ceri, Marco Brambilla, Sara Comai, Piero Fraternali, Marco Dario- IEEE Transactions on Knowledge and Data Engineering 2004

Distributed information management with XML and Web services

Serge Abiteboul- IEEE Transactions on Knowledge and Data Engineering 2004

XML and Web services are revolutioning the automatic management of distributed information, somewhat in the same way HTML, Web browser and search engines modified human access to world wide information. To illustrate, we present Active XML that is based on embedding Web service calls inside XML documents. We mention two particular applications that take advantage of this new technology and novel research issues in this setting. This paper is based primarily on research at INRIA-Futurs in the Gemo Group, around XML and Web services (in particular, the Xyleme, Active XML and Spin projects).

Distributed Reasoning in a Peer-to-peer Setting

Marie-Christine Rousset, Francois Goasdou, Philippe Adjiman, Philippe Chatalic, Laurent Simon- IEEE Transactions on Knowledge and Data Engineering 2004

In a peer-to-peer inference system, each peer can reason locally but can also solicit some of its acquaintances, which are peers sharing part of its vocabulary. In this paper, we consider peer-to-peer inference systems in which the local theory of each peer is a set of propositional clauses defined upon a local vocabulary. An important characteristic of peer-to-peer inference systems is that the global theory (the union of all peer theories) is not known (as opposed to partition-based reasoning systems). The contribution of this paper is twofold. We provide the first consequence finding algorithm in a peer-to-peer setting: it is anytime and computes consequences gradually from the solicited peer to peers that are more and more distant. We exhibit a sufficient condition on the acquaintance graph of the peer-to-peer inference system for guaranteeing the completeness of this algorithm. We also present first experimental results that are promising.

DRYADE: a new approach for discovering closed frequent trees in heterogeneous tree databases.

Marie-Christine Rousset, Alexandre Termier, Mich Sebag- IEEE Transactions on Knowledge and Data Engineering 2004 :543--546

In this paper, we present a novel algorithm for discovering tree paterrns in a tree database. This algorithmuses a relaxed tree inclusion definition, making the problem complex (checking tree inclusion is NP-complete), butallowing to mine highly heterogeneous databases. To obtain good performances, our DRYADE algorithm discovers only closed frequent tree patterns.

Efficient Query Evaluation over Compressed XML Data

Ioana Manolescu, Angela Bonifati, Andrei Arion, Gianni Costa, Andrea Pugliese- IEEE Transactions on Knowledge and Data Engineering 2004 March

Extraction d'arbres fr

Alexandre Termier- IEEE Transactions on Knowledge and Data Engineering 2004 April

La d

Fifty-five solvers in vancouver: The sat 2004 competition.

D. Berre, L.- IEEE Transactions on Knowledge and Data Engineering 2004 to appear

For the third consecutive year, a SAT competition was organized as a joint event with the SAT conference. With 55 solvers from 25 author groups, the competition was a clear success. One of the noticeable facts from the 2004 competition is the superiority of incomplete solvers on satisfiable random k-SAT benchmarks. It can also be pointed out that the complete solvers awarded this year, namely Zchaff, jerusat, satzoo, kcnfs and march-eq, participated in the SAT 2003 competition (or at least former versions of those solvers). This paper is not reporting exhaustive competition results, already available in details online, but rather focuses on some remarkable results derived from the competition dataset. The SAT 2004 competition is ending a 3-year take-off period that attracted new SAT researchers and provided many new benchmarks and solvers to the community. The good participation rate of this year (despite the addition of the anti-blackbox rule) establishes the competition as an awaited yearly event. Some new directions for a new way of thinking about the competition are discussed at the end of the paper.

Filtering Web Documents for eDot, a food risk warehouse

Amar-Djalil Mezaour- IEEE Transactions on Knowledge and Data Engineering 2004 december:249--252

Highlighting latent structure in documents

Nathalie Pernelle, Marie-Christine Rousset, Alexandre Termier, Helka Folch, Benoit Habert, Michele Jardino- IEEE Transactions on Knowledge and Data Engineering 2004 - 4:1131,1334

Extensible Markup Language (XML) is playing an increasingly important role in the exchange of a wide variety of data on the Web and elsewhere. It is a simple, very flexible text format, used to annotate data by means of markup. XML documents can be checked for syntactic well-formedness and semantic coherence through DTD and schema validation which makes their processing easier. In particular, data with nested structure can be easily represented with embedded tags. This structured representation should be used in information retrieval models which take structure into account. As such, it is meta-data and therefore a contribution to the Semantic Web. However, nowadays, there exists huge quantities of raw texts and the issue is how to find an easy way to provide these texts with sensible XML structure. Here we present an automatic method to extract tree structure from raw texts. This work has been supported by the Paris XI University (BQR2002 project, Paris-XI University).

Int

Marie-Christine Rousset, Francois Goasdou- IEEE Transactions on Knowledge and Data Engineering 2004

L'int

Introduction au Web semantique

Chantal Reynaud, Jean Charlet, Philippe Laublet- IEEE Transactions on Knowledge and Data Engineering 2004 - 4(2):07-20

Ce document a pour objectif de donner une vue synthetique des principaux aspects du Web semantique. Il est produit par les membres du groupe de travail et de reflexion mis en place dans le cadre de l'action specifique "Web semantique" (decembre 2001). Les premieres discussions, dans ce groupe, ont permis de choisir un certain nombre de themes apparus essentiels a discuter et a analyser afin de mieux comprendre les projets, les realites et les perspectives ouvertes par le projet du Web semantique. Les chapitres suivants de ce numero special presentent synthetiquement chacun des themes : - les langages du Web semantique - Meta-donnees et annotations dans le Web semantique - Les ontologies pour le Web semantique - L'integration de sources de donnees - Adaptation et personnalisation dans le Web semantique - Les Web services semantiques - Applications du Web semantique.

L'integration de sources de donnees

Chantal Reynaud, Mohand-Sa Hacid- IEEE Transactions on Knowledge and Data Engineering 2004 - 4(2)

La diversite des sources d'information distribuees et leur heterogeneite est une des principales difficultes rencontrees par les utilisateurs du Web aujourd'hui. L'infrastructure du Web semantique doit permettre leur integration donnant ainsi l'impression a l'utilisateur qu'il utilise un systeme homogene. Les solutions a l'integration d'information proposees dans le cadre du Web semantique tireront parti des recherches concernant les approches mediateurs et les entrepots de donnees. Les premieres realisations sont en cours. Un des premiers verrous scientifiques a lever concerne le passage a l'echelle du Web. Parmi les travaux futurs, dont le developpement doit etre favorise, figurent la mise en oeuvre de systemes de mediation decentralises l'etude des problemes lies a l'integration de donnees multimedias, l'integration temps reel et egalement la prise en compte de la complexite croissante des donnees a integrer, signe d'une evolution vers une integration de connaissances.

Lazy Query Evaluation for Active XML

Serge Abiteboul, Tova Milo, Omar Benjelloun, Ioana Manolescu, Nicoleta Preda, Bogdan Cautis- IEEE Transactions on Knowledge and Data Engineering 2004 June

In this paper, we study query evaluation on Active XML documents (AXML for short), a new generation of XML documents that has recently gained popularity. AXML documents are XML documents whose content is given partly extensionally, by explicit data elements, and partly intensionally, by embedded calls to Web services, which can be invoked to generate data. A major challenge in the efficient evaluation of queries over such documents is to detect which calls may bring data that is relevant for the query execution, and to avoid the materialization of irrelevant information. The problem is intricate, as service calls may be embedded anywhere in the document, and service invocations possibly return data containing calls to new services. Hence, the detection of relevant calls becomes a continuous process. Also, a good analysis must take the service signatures into consideration. We formalize the problem, and provide algorithms to solve it. We also present an implementation that is compliant with XML and Web services standards, and is used as part of the AXML system.Finally, we experimentally measure the performance gains obtained by a careful filtering of the service calls to be triggered.

OntoRefiner, a user query refinement interface usable for Semantic Web Portals

Chantal Reynaud, Brigitte Safar, Hassen Kefi- IEEE Transactions on Knowledge and Data Engineering 2004

We present a user interface, the OntoRefiner system, for helping the user to navigate numerous retrieved documents after a search querying a semantic portal which integrates a very important number of documents. Retrieved answers are filtered and the user could be provided only with the answers which are, according to im, the most relevant. The refinement process is based on two technologies, dynamic clustering close to Galois lattice structure combined to the use of a domain ontology. The Galois lattice structure provides a sound basis for the query refinement process. However, its construction as a whole is a very costly process. So, we propose an approach based on the use of a domain ontology, avoiding the cosntruction of the whole Galois lattice. In the paper, we present the algorithm and experimental results.

Order independent temporal properties

Luc Segoufin, Sandra de Amo, Nicole Bidoit- IEEE Transactions on Knowledge and Data Engineering 2004 - 14(2):277--298

The paper investigates temporal properties that are invariant with respect to the temporal ordering and that are expressible by temporal query languages either explicit like FO(less) or implicit like TL. In the case of an explicit time representation, these ``order invariant'' temporal properties are simply those expressible in the language FO(=). In the case of an implicit time representation, we introduce a new language, TL(E^i) that captures exactly these properties. The expressive power of the language TL(E^i) is characterized via a game a la Ehrenfeucht-Fraisse. This provides another proof, using a more classical technique, that the implicit temporal language TL is strictly less expressive than the explicit temporal language FO(less). This alternative proof is interesting by itself and opens new perspectives in the investigation of results of the same kind for more expressive implicit temporal languages than TL.

Path Sequence Based XML Query Processing

Ioana Manolescu, Angela Bonifati, Andrei Arion, Andrea Pugliese- IEEE Transactions on Knowledge and Data Engineering 2004 october

We present the path sequence storage model, a new logical model for storing XML documents. This model partitions XML data and content according to the document paths; and uses ordered sequences as logical and physical structures. Its main advantages are: fast and selective data access plans, and intelligent combination of such plans, based on a summary of the document paths. We validate these benefits through extensive experiments.

Peer-to-peer warehousing of XML resources

Serge Abiteboul, Ioana Manolescu, Nicoleta Preda- IEEE Transactions on Knowledge and Data Engineering 2004

We present KadoP, a distributed infrastructure for warehousing XML resources in a peer-to-peer framework. KadoP allows users to build a shared, distributed repository of resources such as XML documents, semantic information about such documents, Web services, and collections of such items. KadoP2P builds on distributed hash tables as a peer communication layer, and ActiveXML as a model for constructing and querying the resources in the peer network. We describe KadoP's data model, query language, and query processing paradigm.

Positive Active XML

Serge Abiteboul, Tova Milo, Omar Benjelloun- IEEE Transactions on Knowledge and Data Engineering 2004 June

The increasing popularity of XML and Web services have given rise to a new generation of documents, called Active XML documents (AXML), where some of the data is given explicitly while other parts are given intensionally, by means of embedded calls to Web services. Web services in this context can exchange intensional information, using AXML documents as parameters and results. The goal of this paper is to provide a formal foundation for this new generation of AXML documents and services, and to study fundamental issues they raise. We focus on Web services that are (1) monotone and (2) defined declaratively as conjunctive queries over AXML documents. We study the semantics of documents and queries, the confluence of computations, termination and lazy query evaluation.

Raisonnement distribu

Marie-Christine Rousset, Francois Goasdou, Philippe Adjiman, Philippe Chatalic, Laurent Simon- IEEE Transactions on Knowledge and Data Engineering 2004

Dans un syst

Recherche cibl

Amar-Djalil Mezaour- IEEE Transactions on Knowledge and Data Engineering 2004 January - 2:491-502

Les langages de requtes mots-cls pour le web manquent souvent de prcision lorsqu'il s'agit de rechercher des documents particuliers difficilement caractrisables par de simples mots-cls (exemple : des cours java ou des photos de formule 1). Nous proposons un langage multi-critres de type attribut-valeur pour augmenter la prcision de la recherche de documents sur le web. Nous avons exprimentalement montr le gain de prcision de la recherche de documents bas sur ce langage.

Reminiscences of Influential Papers

Bertram Ludascher and Zack Ives and Ioana Manolescu- IEEE Transactions on Knowledge and Data Engineering 2004 September - 33(3)

Report on the first XIME-P workshop

Yannis Papakonstantinou, Ioana Manolescu- IEEE Transactions on Knowledge and Data Engineering 2004 September - 33

Small Can Be Beautiful in the Semantic Web

Marie-Christine Rousset- IEEE Transactions on Knowledge and Data Engineering 2004 - 3298:6--16

In 1984, Peter Patel-Schneider published a paper entitled "Small can be Beautiful in Knowledge Representation" in which he advocated for limiting the expressive power of knowledge representation formalisms in order to guarantee good computational properties and thus make knowledge-based systems usable as part of larger systems. In this paper, I aim at showing that the same argument holds for the Semantic Web: if we want to give a chance for the Semantic Web to scale up and to be broadly used, we have to limit the expressive power of the ontologies serving as semantic marking up of resources. In addition, due to the scale of the Web and the disparity of its users, it is unavoidable to have to deal with distributed heterogeneous ontologies. In this paper, I will argue that a peer-to-peer infrastructure enriched with simple distributed ontologies (e.g., taxonomies) reconciled through mappings is appropriate and scalable for supporting the future Semantic Web.

Sous-ensembles flous d

Rallou Thomopoulos, Patrice Buche, Ollivier Haemmerl- IEEE Transactions on Knowledge and Data Engineering 2004 Janvier:147--158

Les sous-ensembles flous peuvent

The second qbf solvers comparative evaluation.

D., L., M., A. - IEEE Transactions on Knowledge and Data Engineering 2004 to appear

Thesus, a closer view on Web Content management enhanced with link semantics

Maria Halkidi, Iraklis Varlamis, Michalis Vazirgiannis- IEEE Transactions on Knowledge and Data Engineering 2004 June - 16(6):685--700

With the unstoppable growth of the WWW, the great success of Web Search Engines, such as Google and Alta-vista, users now turn to the Web whenever looking for information. However, many users are neophytes when it comes to computer science, yet they are often specialists of a certain domain. These users would like to add more semantics to guide their search through WWW material, whereas currently most search features are based on raw lexical content. We show in this article how the use of the incoming links of a page can be used efficiently to classify a page in a concise manner. This enhances the browsing and querying of web pages. In this article, we focus on the tools needed in order to manage the links, and their semantics. We further process these links using a hierarchy of concepts, akin to an ontology, and a thesaurus. This work is demonstrated by an prototype system, called THESUS, that organizes thematic web documents into semantic clusters. Our contributions in this article are the following: 1- a model and language to exploit link semantics information, 2- the THESUS prototype system, 3- its innovative aspects and algorithms, more specifically the novel similarity measure, between web documents applied to different clustering schemes (DB-Scan and COBWEB), 4- a thorough experimental evaluation proving the value of our approach.

Towards cost-based optimization for data-intensive Web service computations

Ioana Manolescu, Nicolaas Ruberg, Gabriela Ruberg- IEEE Transactions on Knowledge and Data Engineering 2004 October

Towards Efficient Exchange of Active XML Data

Tova Milo, Omar Benjelloun, Shobhit Raj Mathur- IEEE Transactions on Knowledge and Data Engineering 2004 Dec

We study the problem of exchanging Active XML documents (AXML for short), an increasingly popular breed of XML documents, where some parts of the data are only given {em intensionally}, as calls to Web services. As with standard XML, the parties exchanging AXML data can decide on a {em data exchange schema}, to constrain the structure of exchanged documents. For AXML data, the schema also specifies which parts of the data should be explicitly present in the document and which can be intensional. In order to make sure the data conforms to the exchange schema, the sender may need to {em materialize} some of its intensional parts, by invoking the relevant service calls. The challenge is to perform the document analysis, and the consequent service invocations efficiently, to avoid a bottleneck in data exchange. We formalize the problem and provide efficient algorithms to solve it. We also present an implementation, and show that our solution is by far superior to previously proposed algorithms.

Towards flexible querying of XML imprecise data in a dataware house opened on the Web

Patrice Buche, Ollivier Haemmerl, Juliette Dibie-Barth, Mounir Houhou- IEEE Transactions on Knowledge and Data Engineering 2004 june

This paper describes a new subsystem of the Sym'Previus knowledge base. This knowledge base contains information useful to help experts in the field of predictive microbiology. Information has several specific properties: it is incomplete, imprecise and heterogeneous. In the pre-existing Sym'Previus knowledge base, stable data are stored in a relational database and data which do not fit the relational structure are stored in a conceptual graph knowledge base. The MIEL language permits to scan simultaneously both bases in a transparent way for the user, using fuzzy queries. The new subsystem described in the paper contains information found on the Web to complete the knowledge base. This information is stored in XML format. Firstly, we extend the XML model of the knowledge base to represent imprecise data as possibility distributions. Secondly, we present the mapping process used to translate a MIEL query into an XML query to scan the XML knowledge base.

Treillis de Galois Alpha

Veronique Ventos, Henry Soldano, Thibaut Lamadon- IEEE Transactions on Knowledge and Data Engineering 2004 June:175-190

Dans beaucoup d'applications il est utile de repr

ULoad: Choosing the Right Storage for your XML Application

Veronique Benzaken, Ioana Manolescu, Andrei Arion- IEEE Transactions on Knowledge and Data Engineering 2004 November

Validation de graphes conceptuels

Ollivier Haemmerl, Juliette Dibie-Barth, Eric Salvat- IEEE Transactions on Knowledge and Data Engineering 2004 Janvier:135--146

Les travaux men

Vers un optimiseur g

Andrei Arion- IEEE Transactions on Knowledge and Data Engineering 2004

L'objectif de ce stage est d'etudier l'optimisation des requetes independamment du stockage et des methodes d'indexation proposees par tel ou tel systeme XML. Ce travail propose donc d'identifier des structures persistantes de stockage XML pertinentes pour repondre a des requetes XQuery.Nous cherchons a explorer les problemes specifiques qui apparaissent et de degager les limites en terme de performances que la genericite impose.

XML Query Processing: Storage and Query Model Interplay

Ioana Manolescu- IEEE Transactions on Knowledge and Data Engineering 2004 September

Active XML Primer

The Active XML Team- IEEE Transactions on Knowledge and Data Engineering 2003 July

This paper is a general overview of the Active XML language and system.

Adaptive On-Line Page Importance Computation

Serge Abiteboul, Mihai Preda, Gr Cobena- IEEE Transactions on Knowledge and Data Engineering 2003 May

The computation of page importance in a huge dynamic graph has recently attracted a lot of attention because of the web. Page importance, or page rank is defined as the fixpoint of a matrix equation. Previous algorithms compute it off-line and require the use of a lot of extra CPU as well as disk resources (e.g. to store, maintain and read the link matrix). We introduce a new algorithm OPIC that works on-line, and uses much less resources. In particular, it does not require storing the link matrix. It is on-line in that it continuously refines its estimate of page importance while the web/graph is visited. Thus it can be used to focus crawling to the most interesting pages. We prove the correctness of OPIC. We present Adaptive OPIC that also works on-line but adapts dynamically to changes of the web. A variant of this algorithm is now used by Xyleme. We report on experiments with synthetic data. In particular, we study the convergence and adaptiveness of the algorithms for various scheduling strategies for the pages to visit. We also report on experiments based on crawls of significant portions of the web.

An application of the mediator approach to services over the Web

Chantal Reynaud, Gloria Giraldo- IEEE Transactions on Knowledge and Data Engineering 2003 Juillet - 1

This document is about mediation systems integrating services over the Web? First, we give an overview of the mediator approach. In section 3 we focus on integration of services over the Web. We explain which problems arise and we give principles to make the mediator approach applicable in this context. The first principle is a solution to semantic heterogeneity. It exploits results in standardization relative to business transactions. The second principle concerns the scalability of the mediator approach. It emphasizes two points : de-coupling of the various components and strong mediation. In the last section, we present an architecture of a mediation system integrating services, designed in the setting of the PICSEL 2 project. We illustrate with an application of e-commerce in the tourism domain.

Apport d'une ontologie du domaine pour affiner une requ

Brigitte Safar, Hassen Kefi- IEEE Transactions on Knowledge and Data Engineering 2003

In this paper we study how to help used to refine his query when the search for documents in ressources has produced too many answers. Using a domain ontology and a set of ressources indexed with terms of this ontology, we show how we can use the immediate readability and intelligibility of the Galois LAttice clustering structure without paying the cost of its whole building. We also show how to build a refined query in interaction with the user.

Automatic construction and refinement of a class hierarchy over semi-structured data

Nathalie Pernelle, Veronique Ventos, Henri Soldano- IEEE Transactions on Knowledge and Data Engineering 2003

Alpha-Galois lattices for Conceptual Clustering Authors : Nathalie Pernelle, Henri Soldano, Veronique Ventos Universit

Building a Constraint-Based Spatial Database System: Model, Languages, and Implementation

Michel Scholl, Stephane Grumbach, Philippe Rigaux, Luc Segoufin- IEEE Transactions on Knowledge and Data Engineering 2003 - 28(6):563,595

This paper presents dedale, a spatial database system which providesan abstract and non-specialized data model and query language for representating and manipulating geometric data in arbitrarydimension. dedale relies on a logical model based on linearconstraints. The main features of the constraint model are (1) a uniformrepresentation of all sorts of data, including geometric,spatio-temporal or elevation data, (2) an algebraic query language whose formalfoundations constitute a basis for practical query optimization. We show the practical relevance of the approach by describing animplementation which builds on standard technology for datastorage, database indexing and on the parsing and optimization ofSQL.dedale validates the linear constraint model over various applications,proposes a user query language based on SQL which allows to query thedatabase in a purely declarative way, and gives some first steps towards queryoptimization. We believe that this experience is a fruitful step towardsound and consistent database models which hide the complexity ofarbitrary geometric data, while keeping manipulation languages intuitive and efficient.

Challenges in the qbf arena: the sat'03 evaluation of qbf solvers.

D. Berre, L., A.- IEEE Transactions on Knowledge and Data Engineering 2003 June - 2919:468--485

Construction de Classes de Documents Web

Benjamin Nguyen, Maria Halkidi, Iraklis Varlamis, Michalis Vazirgiannis- IEEE Transactions on Knowledge and Data Engineering 2003

Dans cet article, nous proposons une nouvelle mesure de similarit\'e, qui s'accompagne de l'utilisation d'algorithmes de clustering utilis\'es dans le domaine des bases de donn\'ees spatiales, afin de permettre le regroupement (clustering) de documents qui seraient caract\'eris\'es par un ensemble de mots dans une hi\'erarchie (ontologie). Les documents web correspondent \`a ce cas de figure. Le probl\`eme principal que nous abordons dans cet article est la d\'efinition d'une mesure de similarit\'e qui poss\`ede une s\'emantique ad\'equate, et qui se calcule assez facilement en terme de complexit\'e. L'algorithme de 'clustering' que nous utilisons est DB-SCAN, que nous avons modifi\'e un tout petit peu afin de construire une hi\'erarchie d'ensembles (clusters). Le syst\`eme global, THESUS, qui utilise les r\'esultats de cet article, est impl\'ement\'e, et nous donnons \'egalement des r\'esultats d'exp\'eriences pour illustrer la pertinence de notre approche.

Construction et maintenance d'un entrep

Benjamin Nguyen- IEEE Transactions on Knowledge and Data Engineering 2003

DBGlobe: a service-oriented P2P system for global computing

Serge Abiteboul, Michalis Vazirgiannis, Evaggelia Pitoura, Dieter Pfoser, George Samaras- IEEE Transactions on Knowledge and Data Engineering 2003

The challenge of peer-to-peer computing goes beyond simple file sharing. In the DBGlobe project, we view the multitude of peers carrying data and services as a superdatabase. Our goal is to develop a data management system for modeling, indexing and querying data hosted by such massively distributed, autonomous and possibly mobile peers. We employ a service-oriented approach, in that data are encapsulated in services. Direct querying of data is also supported by an XML-based query language. In this paper, we present our research results along the following topics: (a) infrastructure support, including mobile peers and the creation of context-dependent communities, (b) metadata management for services and peers, including locationdependent data, (c) filters for efficiently routing path queries on hierarchical data, and (d) querying using the AXML language that incorporates service calls inside XML documents

Definable Relations and First-Order Query Languages over Strings

Luc Segoufin, Michael Benedikt, Leonid Libkin, Thomas Schwentick- IEEE Transactions on Knowledge and Data Engineering 2003 - 50:694-751

We study analogs of classical relational calculus in the context of strings.We start by studying string logics. Taking a classical model-theoretic approach,we fix a set of string operations and look at the resulting collection of definablerelations. These form an algebra- a class of n-ary relations for every n, closed under projection andBoolean operations. We show that by choosing the string vocabulary carefully,we get string logics thathave desirable properties: computable evaluation and normal forms.We identify five distinct models and study the differences in their model-theoryand complexity of evaluation. We identifya subset of these models which have additional attractive properties, such as finite VC dimensionand quantifier elimination.Once you have a logic, the addition of free predicate symbolsgives you a string query language. The resulting languages have attractive closureproperties from a database point of view: while SQL does not allow thefull composition of string pattern-matching expressions with relationaloperators, theses logics yield compositional query languagesthat can capture common string-matching queries while remaining tractable.For each of the logics studied in the first part of the paper, we studyproperties of the corresponding query languages. We give bounds on thedata complexity of queries, extend the normal form results fromlogics to queries, and show that the languages have correspondingalgebras expressing safe queries.

Different Kinds of Comparisons between Fuzzy Conceptual Graphs

Rallou Thomopoulos, Patrice Buche, Ollivier Haemmerl- IEEE Transactions on Knowledge and Data Engineering 2003 July - 2746:54--68

In the context of a microbiological application, our study proposes to extend the Conceptual Graph Model in order to allow one: (i) to represent imprecise data and queries that include preferences, by using fuzzy sets (from fuzzy set theory) in concept vertices, in order to describe either an imprecise concept type or an imprecise referent; (ii) to query a conceptual graph that may include imprecise data (factual graph) using a conceptual graph that may include preferences (query graph). This is performed in two steps: firstly by extending the projection operation to fuzzy concepts, secondly by defining a comparison operation characterised by two matching degrees: the possibility degree of matching and the necessity degree of matching between two graphs, and particularly between a query graph and a factual graph.

Domain Ontology and Galois Lattice Structure for Query Refinement

Brigitte Safar, Hassen Kefi- IEEE Transactions on Knowledge and Data Engineering 2003 November:597--601

We present a system for interactively querying a topical ressources repository and for navigating the set of responses obtained from the repository. Our system makes use of a domain ontology for dynamically constructing a Gallois lattice. Nodes in this structure are resources clusters which are interactively examined by the used during the construction process.

Dynamic XML Documents with Distribution and Replication

Serge Abiteboul, Tova Milo, Gr Cobena, Ioana Manolescu, Angela Bonifati- IEEE Transactions on Knowledge and Data Engineering 2003

The advent of XML as a universal exchange format and of Web services as a basis for distributed computing, has fostered the apparition of a new class of documents: {\em dynamic XML documents}. These are XML documents where some of the data is given explicitly while other parts are given only intentionally by means of embedded calls to web services, that can be called to generate the required information. By the sole presence of Web services, dynamic documents already include inherently some form of distributed computation. A higher level of distribution that also allows (fragments of) dynamic documents to be distributed and/or replicated over several sites is highly desirable in today's Web architecture, and in fact is also relevant for regular (non dynamic) documents. The goal of this paper is to study new issues raised by the distribution and replication of dynamic XML data. Starting from a data model and a query language, we describe a complete framework for distributed and replicated dynamic XML documents. We provide a comprehensive cost model for query evaluation and show how it applies to user queries and service calls. Finally, we describe an algorithm that, for a given peer, chooses data and services that the peer should replicate to improve the efficiency of maintaining and querying its dynamic documents.

Exchanging Intensional XML Data

Serge Abiteboul, Bernd Amann, Tova Milo, Omar Benjelloun, Fred Dang Ngoc- IEEE Transactions on Knowledge and Data Engineering 2003 June:289--300

XML is becoming the universal format for data exchange between applications. Recently, the emergence of Web services as standard means of publishing and accessing data on the Web introduced a new class of XML documents, which we call {em intensional} documents. These are XML documents where some of the data is given explicitly while other parts are defined only intensionally by means of embedded calls to Web services. When such documents are exchanged between applications, one has the choice to materialize the intensional data (i.e. to invoke the embedded calls) or not, before the document is sent. This choice may be influenced by various parameters, such as performance and security considerations. This paper addresses the problem of guiding this materialization process. We argue that, just like for regular XML data, schemas (ala DTD and XML Schema) may be used to control the exchange of intensional data and, in particular, to determine which data should be materialized before sending a document, and which should not. We formalize the problem and provide algorithms to solve it. We also present an implementation that complies with real life standards for XML data, schemas, and Web services.

Exploring the combined potential of Web sites and Web services

Ioana Manolescu, Stefano Ceri, Marco Brambilla, Pierro Fratemali, Sara Comai- IEEE Transactions on Knowledge and Data Engineering 2003

Recent web service standards, based on WSDL have been adopted as a process-to-process communication paradigm, enabling large-scale, inter-organization application interaction. In this paper, we extend a declarative model and language for specifying data-intensive Web applications, in order to model complex interactions between the application (driven by the user) and remote processes (represented as services). Thus, we demonstrate how a declarative web site specification method can be effectively turned into a tool for web service usage and composition, particularly suited for interactions between human users and web services.

Focused Search on the Web using WeQueL

Amar-Djalil Mezaour- IEEE Transactions on Knowledge and Data Engineering 2003 September:63--74

Keyword-based web query languages suffer from a lack of precision when searching for a precise kind of documents. Indeed, some documents cannot be simply characterized by a list of keywords. For example, searching for on-line pictures dealing with formula one using only simple keywords with general-purpose search-engines gives imprecise answers. This imprecision is due to the method that considers that a relevant document to a query is one that contains a lot of query keywords occurrences. This method is totally unefficient for poor textual-content documents like pictures, video streams... On the other hand, keyword based languages are often not powerful enough for expressing sophisticated document search like on-line java or c++ tutorials. We propose "WeQueL" a multi-criteria query langage for a better characterization of documents. The aim is to increase the precision of document retrieval on the web. In our experiments, we show the gain in accuracy for web document searching using our language.

Integration of heterogeneous, imprecise and incomplete data: an application to the microbiological..

Rallou Thomopoulos, Patrice Buche, Ollivier Haemmerl- IEEE Transactions on Knowledge and Data Engineering 2003 October - 2871:98--107

This paper presents an information system developed to help the assessment of the microbiological risk in food products. UQS (Unified Querying System) is composed of two distinct bases (a relational database and a conceptual graph knowledge base) which are integrated by means of a uniform querying language. The specificity of the system is that both bases include fuzzy data. Moreover, UQS allows the expression of preferences into the queries, by means of the fuzzy set theory.

Knowledge Representation for Information Integration

Marie-Christine Rousset, Chantal Reynaud- IEEE Transactions on Knowledge and Data Engineering 2003 - 29:03,22

An information integration system provides a uniform query interface to a collectionof distributed and heterogeneous information sources, giving users or other agents theillusion that they interrogate a centralized andhomogeneous information system.In this paper, we focus on the use of knowledge representation techniques for building mediators for information integration. A mediator is based on the specification of a single mediated schema describing a domain of interest, and on a set of source descriptions expressing how the content of each source available to the system is related to the domain of interest.These source descriptions, also called mappings because they model the correspondence between the mediated schema and the schemas of the data sources, play a central role in the query answering process. We present two recent information integration systems, namely Picsel and Xyleme, which are illustrative of two radically different choices concerning the expressivity of the mediated schema.

Logical Interpretation of Fuzzy Conceptual Graphs

Rallou Thomopoulos, Patrick Bosc, Patrice Buche, Ollivier Haemmerl- IEEE Transactions on Knowledge and Data Engineering 2003 July:173--178

In previous studies, we have extended the conceptual graph model, which is a knowledge representation model belonging to the family of semantic networks, to be able to represent fuzzy values. The basic conceptual graph model has a logical interpretation in first-order logic. In this paper, we focus on the logical interpretation of the conceptual graph model extended to fuzzy values: we use logical implications stemming from fuzzy logic, so as to extend the logical interpretation of the model to fuzzy values and to comparisons between fuzzy conceptual graphs.

Managing an XML Warehouse in a P2P Context

Serge Abiteboul- IEEE Transactions on Knowledge and Data Engineering 2003 - 2681:4--13

The paper is about warehousing XML content in a P2P environment. The role of an XML warehouse is to offer a centralizer entry point to access a variety of information in an enterprise and provide functionalities to acquire, enrich, monitor and maintain this information. We consider a number of reasons to use a P2P approach to such warehouses, e.g., reducing the cost or sharing the management of information. We mention research issues that are raised.

Managing Distributed Workspaces with Active XML

Serge Abiteboul, Tova Milo, Gr Cobena, Ioana Manolescu, Angela Bonifati, J Baumgarten, Cosmin Cremarenco, Florin Dragan, Nicoleta Preda- IEEE Transactions on Knowledge and Data Engineering 2003

Mediation de services sur le Web

Chantal Reynaud, Gloria Giraldo- IEEE Transactions on Knowledge and Data Engineering 2003 :59--68

This document is about mediation systems integrating services over the Web to make them usable to final users. As services on the Web are numerous and heterogeneous, as they must sometimes be combined to satisfy final users requirements, interfaces giving the illusion of a convivial unique centralized and homogeneous access are necessary. In this paper we propose to use a mediator approach to design such an interface.First, we give an overview of the mediator approach. Then Section 3 is about problems specific to mediation systems integrating services on the Web. We explain which problems arise and we give principles to make the mediator approach applicable in this context. The first principle is a solution to semantic heterogeneity. It exploits results in standardization relative to business transactions. The second principle concerns the scalability of the mediator approach. It emphasizes two points : de-coupling of the various components and strong mediation. In the last section, we present an architecture of a mediation system integrating services, designed in the setting of the PICSEL 2 project. We illustrate with an application of e-commerce in the tourism domain.

Model, Design and Construction of a Service-Oriented Web-Warehouse

Serge Abiteboul, Gr Cobena, Benjamin Nguyen, Antonella Poggi- IEEE Transactions on Knowledge and Data Engineering 2003

We propose a new methodology, a language and tools for the design and construction of Web data warehouses. Our approach is Service Oriented, in that our framework makes an extensive use of Web Services and semi-structured data (XML) to define the data structures, the services and the connections between them. We present an experimental version of the system that has been built using this framework. It uses external Web Services such as Google

Organization of Web Document Collections Based on Link Semantics

Gr Cobena, Benjamin Nguyen, Michalis Vazirgiannis- IEEE Transactions on Knowledge and Data Engineering 2003

Poster at ECDL 2003

Picsel and Xyleme: two illustrative information integration agents

Marie-Christine Rousset, Chantal Reynaud- IEEE Transactions on Knowledge and Data Engineering 2003

An information integration agent provides a uniform query interface to a collection of distributed and heterogeneous information sources, giving users or other agents the illusion that they interrogate a centralized and homogeneous information system. In this chapter we focus on integration information agents that follow a mediator approach. A mediator is based on the specification of a single mediated schema describing a domain of interest, and on a set of source descriptions expressing how the content of each source available to the system is related to the domain of interest. These source descriptions, also called mappings because they model the correspondence between the mediated schema and the schemas of the data sources, play a central role in the query answering process. We present two recent information integration agents, namely Picsel and Xyleme, which are illustrative of two radically different choices concerning the expressivity of the mediated schema.

Querying Distributed Data through Distributed Ontologies: a Simple but Scalable Approach

Marie-Christine Rousset, Francois Goasdou- IEEE Transactions on Knowledge and Data Engineering 2003 - 18(5):60,65

In this paper, we define a simple but scalable framework for peer-to-peer data sharing systems, in which the problem of answering queries over a network of semantically related peers is always decidable. Our approach is characterized by a simple class-based language for defining peer schemas as hierarchies of atomic classes, and mappings as inclusions of logical combinations of atimic classes. We provide an anytime and incremental method for computing all the certain answers to a query posed to a given peer such that the answers are orderes from the ones involving peers close to the queried peer to the ones involving more distant peers.

Reachability and Connectivity Queries in Constraint Databases

Luc Segoufin, Michael Benedikt, Martin Grohe, Leonid Libkin- IEEE Transactions on Knowledge and Data Engineering 2003 - 66(1):169-206

It is known that standard query languages for constraint databaseslack the power to express connectivity properties. Such properties areimportant in the context of geographical databases, where onenaturally wishes to ask queries about connectivity (what are theconnected components of a given set?) or reachability (is there a pathfrom A to B that lies entirely in a given region?). No existingconstraint query languages that allow closed form evaluation canexpress these properties.In the first part of the paper, we show that in principle there is noobstacle to getting closed languages that can express connectivity andreachability queries. In fact, we show that adding any topologicalproperty to standard languages like FO LIN and FO POLY results in aclosed language. In the second part of the paper, we look fortractable closed languages for expressing reachability andconnectivity queries. We introduce path logic, which allows one tostate properties of paths with respect to given regions. We show thatit is closed, has polynomial time data complexity for linear andpolynomial constraints, and can express a large number of reachabilityproperties beyond simple connectivity. Query evaluation in the logicinvolves obtaining a discrete abstraction of a continuous path, andmodel-checking of temporal formulae on the discrete structure.

Representation of weakly structured imprecise data for fuzzy querying

Rallou Thomopoulos, Patrice Buche, Ollivier Haemmerl- IEEE Transactions on Knowledge and Data Engineering 2003 October - 140(1):111--128

Schema-driven Customization of Web Services

Serge Abiteboul, Tova Milo, Omar Benjelloun, Bernd Aman, J Baumgarten, Frederic Dang Ngoc- IEEE Transactions on Knowledge and Data Engineering 2003 September:1093--1096

Semantic Integration in Xyleme: a Uniform Tree-Based Approach

Claude Delobel, Dan Vodislav, Marie-Christine Rousset, Chantal Reynaud, Jean-Pierre Sirot- IEEE Transactions on Knowledge and Data Engineering 2003 March - 44(3):267,298

Xyleme is a huge warehouse integrating XML data of the web. Xyleme considers a simple data model with data trees and tree types for describing the data sources, and a simple query language based on tree queries with boolean conditions. The main components of the data model are a mediated schema modeled by an abstract tree type, as a view of a set of tree types associated with actual data trees, called concrete tree types, and a mapping expressing the connection between the mediated schema and the concrete tree types. The first contribution of this paper is formal: we provide a declarative model-theoretic semantics for Xyleme tree queries, a way of checking tree query containment, and a characterization of tree queries as a composition of branch queries. The other contributions are algorithmic and handle the potentially huge size of the mapping relation which is a crucial issue for semantic integration and query evaluation in Xyleme. First, we propose a method for pre-evaluating queries at comile time by storing some specific meta-information about the mapping into map translation tables. These map translation tables summarize the set of all the branch queries that can be generated from the mediated schema and the set of all the mappings. Then, we propose different operators and strategies for relaxing queries which, having an empty map translation table, will have no answer if they are avaluated against the data. Finally, we present a method for semi-automatically generating the mapping relation.

Specification and design of workflow-driven hypertexts

Ioana Manolescu, Stefano Ceri, Marco Brambilla, Sara Comai, Pierro Fraternali- IEEE Transactions on Knowledge and Data Engineering 2003

While the Web consolidates as the ubiquitous application delivery platform, the features of Web applications evolve to cover new requirements, like the capability of managing complex workflows spanning multiple users and organizations. This scenario challenges the Web engineering methods to address a broader class of applications. This paper introduces Workflow Driven Hypertexts, defined as Web-enabled hypertextual applications serving the workflow of multiple users, and proposes a design method integrating process, data, and hypertextual modeling, expressly conceived to support the development of this class of applications.

Specification and Design of Workflow-Driven Hypertexts

Ioana Manolescu, Stefano Ceri, Marco Brambilla, Sara Comai, Piero Fraternali- IEEE Transactions on Knowledge and Data Engineering 2003

While the Web consolidates as the ubiquitous application delivery platform, the features of Webapplications evolve to cover new requirements, like the capability of managing complex workflowsspanning multiple users and organizations. This scenario challenges the Web engineering methods toaddress a broader class of applications. This paper introduces workflow-driven hypertexts, defined asWeb-enabled hypertextual applications serving the workflow of multiple users, and proposes a designmethod integrating data, hypertext, and workflow modeling concepts for modeling lightweight Webenabledworkflows; this approach extends the benefits of high-level conceptual modeling and automaticcode generation to a much broader class of Web applications.

The essentials of the sat'03 competition.

D. Berre, L.- IEEE Transactions on Knowledge and Data Engineering 2003 - 2919:452--467

The SAT 2003 Competition ran in February - May 2003, in conjunction with SAT'03 (the Sixth Fifth International Symposium on the Theory and Applications of Satisfiability Testing). One year after the SAT 2002 competition, it was not clear that significant progress could be made in the area in such a little time. The competition was a success -

The nuts and bolts of DBMS construction: building your own prototype

Ioana Manolescu, Alberto Lerner- IEEE Transactions on Knowledge and Data Engineering 2003

THESUS: Organizing Web Document Collections Based On Semantics And Clustering

Benjamin Nguyen, Maria Halkidi, Iraklis Varlamis, Michalis Vazirgiannis- IEEE Transactions on Knowledge and Data Engineering 2003 November - 12(4):320--332

The requirements for effective search and management of the WWW content are stronger than ever. Document retrieval in the WWW is problematic with regards to three aspects: i. Not taking adequately into account link semantics for characterizing documents ii. Basing document similarity on binary keyword matching, thus the matching of different yet semantically close keywords would return false. iii. Presenting search results in terms of lengthy URL lists. The previous observations lead us to the adoption of an ontology and a thesaurus to exploit links, in order to provide a better characterization of web documents. The enhancement of the characterization of the documents makes operations such as clustering become very interesting. We have devised a system called THESUS to achieve these goals. The system deals with an initial set of web pages, extracts keywords from all pages' incoming links and converts them to semantics by mapping them to a domain's ontology. Subsequently, a clustering algorithm is applied to discover document clusters. The clustering results are used for: i. Web pages classification ii. speeding up (in terms of pruning) keyword based queries. The system is fully functional and testing proved the quality of its results. The innovations lie in the integrated use of: i. incoming link semantics as the fundamental document characterization mechanism, ii. the conversion from keywords to semantics and iii. the clustering of web content using the semantic similarity between documents.

Typing and querying XML documents: some complexity bounds

Luc Segoufin- IEEE Transactions on Knowledge and Data Engineering 2003

We study the complexity bound of validating XML documents, viewed as labeled unranked ordered trees, against various typing systems like DTDs, XML schemas, tree automata... We also consider query evaluation complexities for various fragments of XPath. For both problems, validation and query evaluation, we consider data and combined complexity bounds.

Website Identification

Pierre Senellart- IEEE Transactions on Knowledge and Data Engineering 2003 September

I present in this paper a method to discover the set of webpages contained in a logical website, based on the link structure of the Web graph. Such a method is useful to identify the boundaries of what to crawl, in the context of Web archiving. For this purpose, I combine the use of an online version of the preflow-push algorithm, an algorithm for the maximum flow problem in traffic networks, and of the Markov CLuster (MCL) algorithm. The latter is used on a crawled portion of the Web graph in order to build a seed of initial webpages, a seed which is extended by the former. Experiments on subsites of the INRIA Website, which give satisfactory results, are described.

XQueC: Pushing Queries to Compressed XML Data

Ioana Manolescu, Angela Bonifati, Andrei Arion, Gianni Costa, Andrea Pugliese, Sandra dAguanno- IEEE Transactions on Knowledge and Data Engineering 2003 :1065--1068

ZooM : Alpha Galois Lattices for Conceptual Clustering

Nathalie Pernelle, Veronique Ventos, Henry Soldano- IEEE Transactions on Knowledge and Data Engineering 2003 October

The main drawback of Galois lattices is that their size can be very large when dealing with real-world data sets. One way to reduce the number of nodes is to use a language with a weak expressivity. In this paper, we propose an approach based on the new notion of alpha extension where the number of nodes is reduced according to a change of the extensional function.

A comparative study for XML change detection

Gr Cobena, Talel Abdessalem, Yassine Hinnach- IEEE Transactions on Knowledge and Data Engineering 2002

Change detection is an important part of version management for databases and document archives. The success of XML has recently renewed interest in change detection on trees and semi-structured data, and various algorithms have been proposed. We study here different algorithms and representations of changes based on their formal definition and on experiments conducted over XML data from the Web. Our goal is to provide an evaluation of (i) the quality of the results (ii) the performance of the tools. We also consider in which context each of these algorithms can be best used.

A comparison of Labeling Schemes for Ancestor Queries

Tova Milo, Haim Kaplan, Ronen Shabo- IEEE Transactions on Knowledge and Data Engineering 2002

Motivated by a recent application in XML search engines we study the problem of labeling the nodes of a tree (XML file) such that given the labels of two nodes one can determine whether one node is an ancestor of the other. We describe several new prefix-based labeling schemes, where an ancestor query roughly amounts to testing whether one label is a prefix of the other. We compare our new schemes to a simple interval-based scheme currently used by search engines, as well as, to schemes with the best theoretical guarantee on the maximum label length. We performed our experimental evaluation on real XML data and on some families of random trees.

A First Experience in Archiving the French Web

Serge Abiteboul, Gr Cobena, Julien Masanes, Gerald Sedrati- IEEE Transactions on Knowledge and Data Engineering 2002

The web is a more and more valuable source of information and organizations are involved in archiving (portions of) it for various purposes, e.g., the Internet Archive (www.archive.org). A new mission of the French National Library (BnF) is the ``d

A hash-tree based algorithm for subset detection: analysis and experiments

Serge Abiteboul, Benjamin Nguyen- IEEE Transactions on Knowledge and Data Engineering 2002 January

In this paper, we present both an analysis, and experimental results of an efficient algorithm for subset detection. The data structure used in this algorithm is a hash-tree. The algorithm was used as the core of the subscription system in an XML data warehouse. This context sets the range of the parameters of the problem. Both t\ he analysis, and experimental results concur to show that the system can process thousands of documents per second, on a data base of millions of subscriptions.

Active XML: A Data-Centric Perspective on Web Services

Serge Abiteboul, Tova Milo, Omar Benjelloun, Ioana Manolescu, Roger Weber- IEEE Transactions on Knowledge and Data Engineering 2002

We present Active XML (AXML, for short), a declarative framework thatharnesses web services for data integration, and is put to work in apeer-to-peer architecture. It is centered around \emph{AXMLdocuments} which are XML documents that may contain calls to webservices. The language allows both the specification of suchdocuments and the definition of new web services based on them.While documents with embedded calls were already proposed before, AXMLis first to actually turn calls to web services embedded in XMLdocuments into a powerful tool for web-scale data integration. Inparticular, the language includes linguistic features to control thetiming of service call activations, the lifespan of data, and theexchange of extensional and intensional data. Various scenarios arecaptured, such as mediation, data warehousing, and distributedcomputation. A first prototype is described.

Active XML: Peer-to-Peer Data and Web Services Integration

Serge Abiteboul, Tova Milo, Omar Benjelloun, Ioana Manolescu, Roger Weber- IEEE Transactions on Knowledge and Data Engineering 2002

We demonstrate how the Active XML framework can be used to build a distributed auctioning system.

Aide

Alain Bidault- IEEE Transactions on Knowledge and Data Engineering 2002

Aide a la formulation de requêtes dans un mediateur

Chantal Reynaud, Brigitte Safar- IEEE Transactions on Knowledge and Data Engineering 2002 Janvier - 1:201-210

Information integration over existing sources that are distributed and possibly heterogeneous is complex. Mediator systems have in common the use of a domain model, also called ontology.The paper addresses the benefits coming from using such an ontology in order to helpusers to pose queries and present the inference mechanisms which have been developed in the context of the PICSEL project. We illustrate these works with examples coming the tourism domain.

Compilation and Approximation of Conjunctive Queries by Concept Descriptions

Marie-Christine Rousset, Francois Goasdou- IEEE Transactions on Knowledge and Data Engineering 2002 :267-271

In this paper, we characterize the logical correspondence between conjunctive queries and concept descriptions. We exhibit a necessary and sufficient condition for the compilation of a conjunctive query into an equivalent ALE concept description. We provide a necessary and sufficient condition for the approximation of conjunctive queries by maximally subsumed ALN concept descriptions.

Comprendre XSLT

Bernd Amann, Philippe Rigaux- IEEE Transactions on Knowledge and Data Engineering 2002

De m

Computing web page importance without storing the graph of the web

Serge Abiteboul, Mihai Preda, Gr Cobena- IEEE Transactions on Knowledge and Data Engineering 2002 - 25(1):27--33

The computation of page importance in a huge dynamic graph has recently attracted a lot of attention because of the web. Page importance or page rank is defined as the fixpoint of a matrix equation. Previous algorithms compute it off-line and require the use of a lot of extra CPU as well as disk resources in particular to store and maintain the link matrix of the web. We briefly discuss a new algorithm that works on-line, and uses much less resources. In particular, it does not require storing the link matrix. It is on-line in that it continuously refines its estimate of page importance while the web/graph is visited. When the web changes, page importance changes as well. We modify the algorithm so that it adapts dynamically to changes of the web. We report on experiments on web data and on synthetic data.

Computing web page importance without storing the graph of the web (extended abstract)

Serge Abiteboul, Mihai Preda, Gr Cobena- IEEE Transactions on Knowledge and Data Engineering 2002 March - 25

The computation of page importance in a huge dynamic graph has recently attracted a lot of attention because of the web. Page importance or page rank is defined as the fixpoint of a matrix equation. Previous algorithms compute it off-line and require the use of a lot of extra CPU as well as disk resources in particular to store and maintain the link matrix of the web. We briefly discuss a new algorithm that works on-line, and uses much less resources. In particular, it does not require storing the link matrix. It is on-line in that it continuously refines its estimate of page importance while the web/graph is visited. When the web changes, page importance changes as well. We modify the algorithm so that it adapts dynamically to changes of the web. We report on experiments on web data and on synthetic data.

Construction and Maintenance of a Set of Pages of Interest (SPIN)

Serge Abiteboul, Gr Cobena, Benjamin Nguyen, Antonella Poggi- IEEE Transactions on Knowledge and Data Engineering 2002 October

Dans cet article, nous nous int

Construction de Médiateurs pour Integrer des Sources d'information multiples et heterogenes

Marie-Christine Rousset, Chantal Reynaud, Francois Goasdou, Brigitte Safar, H Gagliardi, Alain Bidault, Christine Froidevaux- IEEE Transactions on Knowledge and Data Engineering 2002 - 2:09--59

Le nombre croissant de donnees accessibles via des reseaux (intranet, internet, etc.) pose le probleme de l'integration de sources d'information pre-existantes, souvent distantes et heterogenes, afin de faciliter leur interrogation par un large public. Une des premieres approches proposees en integration d'informations prone la construction de mediateurs. Un mediateur joue un role d'interface de requetes entre un utilisateur et des sources de donnees. Il donne a l'utilisateur l'illusion d'interroger un systeme homogene et centralise en lui evitant d'avoir a trouver les sources de donnees pertinentes pour sa requete, de les interroger une a une, et de combiner lui-meme les informations obtenues. L'objectif de cet article est de presenter le projet PICSEL qui offre un environnement declaratif de construction de mediateurs. PICSEL se distingue des systemes d'integration d'informations existants par la possibilite d'exprimer le schema du mediateur dans un langage (Carin) combinant le pouvoir d'expression d'un formalisme a base de regles et d'un formalisme a base de classes (la logique de description ALN). PICSEL integre un module d'affinement de requetes, premiere brique d'un module de dialogue cooperatif entre un mediateur et ses utilisateurs.

Construction semi-automatique d'ontologies a partir de DTDs relatives a un meme domaine

Chantal Reynaud, Gloria Giraldo- IEEE Transactions on Knowledge and Data Engineering 2002 Mai

A mediator system is composed of a global schema, or domain ontology, whose role is twofold. First, it conects all the autonomous and distributed sources to each other. Second, it allows users to pose queries in terms of the mediated schema rather than directly in terms of the sources schemas. In this paper, we focus on the automation of the construction of an ontology in the context of a domain-specific mediator querying XML documents.The ontology that we want to build is a schema based on classes. The approach is based on the DTDs associated to the XML documents. The paper describes it and presents results coming from the first experiments.

Correspondence and Translation for Heterogeneous Data

Serge Abiteboul, Sophie Cluet, Tova Milo- IEEE Transactions on Knowledge and Data Engineering 2002 - 275(1-2):179-213

A primary motivation for new database technology is to provide support for the broad spectrum of multimedia data available notably through the network. These data are stored under different formats: relational or ODMG model (in databases), SGML or LaTex (documents), DX formats (data exchange formats in scientific data), Step (CAD/CAM data), etc. Their integration is a very active field of research and development. In this paper, we provide a formal foundation to facilitate the integration of such heterogeneous data and the maintenance of heterogeneous replicated data.

Detecting Changes in XML Documents

Serge Abiteboul, Amelie Marian, Gr Cobena- IEEE Transactions on Knowledge and Data Engineering 2002

We present a 'diff' algorithm for XML data. This work is motivated by the support for change control in the context of the Xyleme project that is investigating warehouses capable of storing massive volume of XML data found on the web. Because of the context, the algorithm has to be very efficient in terms of speed and memory space even at the cost of some loss of ''quality''. Also, it considers, besides insertions, deletions and updates (standard in diffs), a 'move' operation on subtrees that is essential in the context of XML. Intuitively, our 'diff' algorithm uses signatures to match (large) subtrees that were left unchanged between the old and new versions. Such exact matchings are then possibly propagated to ancestors and descendants to obtain more matchings. We provide a performance analysis of the algorithm. We show that it runs in average in linear time vs. quadratic time for previous algorithms. We present experiments on synthetic data that confirm the analysis. This linear time is obtained by trading some quality. We present experiments (again on synthetic data) that show that the output of the algorithm is reasonably close to the optimal in terms of quality. Finally we present experiments on a small set of versions of XML pages found on the Web. Keywords: XML, Diff, Delta, Tree Pattern Matching, Change Control, Version.

Issues in Monitoring Web Data

Serge Abiteboul- IEEE Transactions on Knowledge and Data Engineering 2002

The web is turning from a collection of static documents to a global source of dynamic knowledge. First, HTML is increasingly complemented by the more structured XML format augmented with mechanisms for providing semantics such as RDF. Also, we believe that the very passive vision of web sites promoted by the current HTTP will soon turn into a much more active one based on pub/sub and web services. Thus, the monitoring of the web is likely to raise new challenges for many years. We consider here some applications involving web monitoring and some issues they raise.

Labeling Dynamic XML Trees

Tova Milo, Haim Kaplan, Edith Cohen- 2002

We present algorithms to label the nodes of an XML tree which is subject to insertions and deletions of nodes. The labeling is done such that (1)~we label each node immediately when it is inserted and this label remains unchanged, and (2)~from a pair of labels alone, we can decide whether one node is an ancestor of the other. This problem arises in the context of XML databases that support queries on the structure of the documents as well as on the changes made to the documents over time. We prove that our algorithms assign the shortest possible labels (up to a constant factor) which satisfy these requirements. We also consider the same problem when "clues" that provide guarantees on possible future insertions are given together with newly inserted nodes. Such clues can be derived from the DTD or from statistics on similar XML trees. We present algorithms that use the clues to assign shorter labels. We also prove that the length of our labels is close to the minimum possible.

Mining XML Data with Frequent Trees

Marie-Christine Rousset, Alexandre Termier, Mich Sebag- 2002

The paper aims at extracting common tree structures in heterogeneous XML data. This approach extends Frequent Item Set techniques to tree structure representation. We provide an algorithm which extracts all subtrees embedded in the data above a given frequency threshold. As a result, this technique provides both a clustering of the data and a symbolic characterization of each cluster. This characterisation is made of the maximally specific subtrees embedded in all the data trees of each cluster. The scalability of the algorithm is investigated on artificial medium-sized data.

On first-order topological queries

Luc Segoufin, Martin Grohe- IEEE Transactions on Knowledge and Data Engineering 2002 - 3(3):336-358

One important class of spatial database queries is the class oftopological queries, i.e. queries invariant under homeomorphisms. We study topological queries expressible in the standard querylanguage on spatial databases, first-order logic with various amountsof arithmetic. Our main technical result is a combinatorialcharacterization of the expressive power of topological first-orderlogic on regular spatial databases.

Ontology-Based Integration of XML Web Resources

Catriel Beeri, Bernd Amann, Michel Scholl, Irini Fundulaki- 2002

This paper deals with modeling issues encountered in the context of the integration of heterogeneous XML sources in the C-Web project. The goal of this project is to design and implement tools for building semantic portals for communities of users sharing XML Web resources in a common domain of interest. The backbone of such a portal is a domain ontology which is used as the common interface component or integrating, querying and exchanging information.

Pattern Tree Matching for XML Queries

Sophie Cluet, Vincent Aguilera, Fr Boiscuvier, Bruno Koechlin- 2002

The main paradigm underlying a query language for XML is that of selecting data by matching patterns described by path expressions. These path expressions describe traversals along sub-elements with variables getting bound to nodes along these paths. Hence, a key operation in query processing over XML is to produce a set of bindings for variables, given a pattern consisting of several path expressions. In this paper we focus on the implementation of a special purpose algebraic operator, called PatternScan, that captures this paradigm. A PatternScan filters a forest of XML documents according to some pattern and returns a set with one tuple per matching tree structure found somewhere in the forest. Given an inverted index based on a numbering scheme for tags/words, we show that a PatternScan can be evaluated as a sequence of joins between posting lists, without the need for intermediate data structures. Furthermore, the evaluation time is in most cases linear both in the size of the pattern and the average number of postings per pattern node. We believe that those properties are particularly interesting when considering applications such as web-scale XML search engines, where the time spent to compute the first results and the memory overhead of a single query must be as low as possible. Experimental results from our system, based on documents and queries from the XML Benchmark Project XmlBenchmark, show that the proposed algorithm can dramatically improve the execution time of a large class of XML queries, together with a relatively low overhead in terms of space needed by the index.

Proximit

Brigitte Safar, Alain Bidault, Christine Froidevaux- IEEE Transactions on Knowledge and Data Engineering 2002 - 2:653-662

La recherche d'informations au sein de sources multiples et heterogenes est complexe. Les approches mediateur proposent des solutions qui exploitent une ontologie du domaine. dans ce papier, nous decrivons comment l'ontologie peut aider a evaluer la proximite de deux requetes et permet, lorsque la requete posee par l'utlisateur est en echec, de choisir une requete proche. le travail presente a ete effectue au sein du projet PICSEL. Les exemples cites sont issus du domaine des produits du tourisme, domaine d'experimentation choisi au sein du projet.

Querying XML Sources using an Ontology-based Mediator

Catriel Beeri, Bernd Amann, Michel Scholl, Irini Fundulaki- IEEE Transactions on Knowledge and Data Engineering 2002

In this paper we propose a mediator architecture for the querying and integration of Web-accessible XML data sources. Our contributions are (i) the definition of a simple but expressive mapping language, following the local as view approach and describing XML resources as local views of some global schema, and (ii) efficient algorithms for rewriting user queries according to existing source descriptions. The approach has been validated by the StyX prototype.

Representation of Ontologies for Information Integration

Chantal Reynaud, Brigitte Safar- IEEE Transactions on Knowledge and Data Engineering 2002 Octobre

An information integration system provides a uniform query interface to a collection of autonomous and distributed sources, connected to each other thanks to a global mediated schema, called domain ontology. The problem addressed in the paper is how to represent such an ontology into CARIN-ALN, a formalism combining classes and rules. We focus on the choices for representing classes, properties and constraints using the characteristics of the formalism. We also propose a method in two steps for representing a domain ontology in the framework of a mediator. The first step is directed by the formalism and the functionalities of the mediator. The second step is an optimization phase guided by the way functionalities of the mediator are implemented.

Similarity Between Queries in a Mediator

Brigitte Safar, Alain Bidault, Christine Froidevaux- IEEE Transactions on Knowledge and Data Engineering 2002 :235-239

Answering queries on remote and heterogeneous sources is a complex task. Mediators provide techniques that exploit a domain ontology. To propose a cooperative approach to repair queries without answers, we introduce the notion of solution which is a query close to the user's query.. we describe how ontologies can be used to evaluate the similarity of two predicates ans then of two queries. The work presented has been developed in the PICSEL project. Examples come from the domain of tourism, the experimentation domain chosen for this project.

STYX : Connecting the XML World to the World of Semantics

Catriel Beeri, Bernd Amann, Michel Scholl, Anne-Marie Vercoustre, Irini Fundulaki- IEEE Transactions on Knowledge and Data Engineering 2002

The Semantic Web Needs Languages for Representing (Complex) Mappings Between (Simple) Ontologies

Marie-Christine Rousset- IEEE Transactions on Knowledge and Data Engineering 2002 - 17(2)

The main issue for a Web ontology language is less the choice of a language for describing ontologies than the choice of a language for representing mappings between ontologies.

The Xyleme Project

Serge Abiteboul, Sophie Cluet, Guy Ferran, Marie-Christine Rousset- IEEE Transactions on Knowledge and Data Engineering 2002 - 39:225-238

The current development of the Web and the generalization of XML technonology provides a major opportunity to radically change the management of distributed data. We have developed a prototype of a dynamic warehouse for XML data, namely Xyleme. In the present paper, we briefly present some motivation and important aspects of the work performed in the framework of the Xyleme project.

TreeFinder: a First Step towards XML Data Mining

Marie-Christine Rousset, Alexandre Termier, Mich Sebag- 2002

In this paper, we consider the problem of searching frequent trees from a collectionof tree-structured data modeling XML data. The TreeFinder algorithm aims at finding trees, such that their exact or perturbed copies are frequent in a collection of labelled trees. To cope with complexity issues, TreeFinder is correct but not complete: it finds a subset of the actually frequent trees. The default of completeness is experimentally investigated on artificial medium size datasets; it is shown that TreeFinder reaches completeness or falls short to it for a range of experimental settings.

Validating Streaming XML Documents

Victor Vianu, Luc Segoufin- IEEE Transactions on Knowledge and Data Engineering 2002

This paper investigates the on-line validation of streaming XML documents withrespect to a DTD, under memory constraints. We first consider validation usingconstant memory, formalized by a finite-state automaton (fsa). We examinetwo flavors of the problem, depending on whether or not the XML document isassumed to be well-formed. The main results of the paper provide conditions onthe DTDs under which validation of either flavor can be done using an fsa. For DTDs that cannot be validated by an sc fsa, we investigate twoalternatives. The first relaxes the constant memory requirement by allowing astack bounded in the depth of the XML document, while maintaining thedeterministic, one-pass requirement. The second approach consists in refiningthe DTD to provide additional information that allows validation by an fsa.

Vers l'automatisation de la construction de systemes de mediation pour le commerce electronique

Chantal Reynaud, Gloria Giraldo- IEEE Transactions on Knowledge and Data Engineering 2002

Views in a Large Scale XML Repository

Sophie Cluet, Tova Milo, Vincent Aguilera, Pierangelo Veltri, Dan Vodislav- IEEE Transactions on Knowledge and Data Engineering 2002

We are interested in defining and querying views in a huge andhighly heterogeneous XML repository (Web scale). In this context, view definitions are very large, involving lots of sources, and there is no apparent limitation totheir size. This raises interesting problems that we address in thepaper: (i) how to distribute views over several machines withouthaving a negative impact on the query translation process; (ii) howto quickly select the relevant part of a view given a query; (iii) howto minimize the cost of communicating potentially large queries to the machines where they will be evaluated. The solution that we propose is based on a simple view definition language that allows for automatic generation of views. The language maps paths in the view abstract DTD to paths in theconcrete source DTDs. It enablesa distributed implementation of the view system that is scalable both in termsof data and load. In particular, the query translation algorithm is shown to have a good (linear) complexity.

Web services and data integration

Serge Abiteboul, Tova Milo, Omar Benjelloun- IEEE Transactions on Knowledge and Data Engineering 2002

The developments of XML and Web services (e.g. SOAP) are changing radically the art of data integration. We briefly describe Web services and consider some of their impact on data integration. We argue that XML and Web services provide the proper infrastructure for data integration at the Web scale. This is illustrated by some work going on at INRIA on Active XML, that is XML extended by allowing the embedding of calls to Web services.

Web services and data integration

Serge Abiteboul, Tova Milo, Omar Benjelloun- IEEE Transactions on Knowledge and Data Engineering 2002

Zoom : a nested Galois lattices-based system for conceptual clustering

Nathalie Pernelle, Marie-Christine Rousset, Veronique Ventos, Henri Soldano- IEEE Transactions on Knowledge and Data Engineering 2002 september - 14(2):157--187

This paper deals with the representation of multi-valued data by clustering them in a small number of classes organized in a hierarchy and described at an appropriate level of abstraction. The contribution of this paper is three fold. First we investigate a partial order namely nesting, relating Galois lattices. A Nested Galois lattice is obtained by reducing (through projections) the original lattice. As a consequence it makes coarser the equivalence relations defined on extents and intents. Second, we investigate the intensional and extensional aspects of the languages used in our system Zoom. In particular we discuss the notion of alpha-extension of terms of a class language L. We also present our most expressive language L3, close to a descritpion logic, and which expresses optionality or/and multi-valuation of attributes. Finally, the nesting order between the Galois lattices corresponding to various languages and extensions is exploited in the interactive system ZooM. Typically, a ZooM session starts from a propositional language L2 and a coarse view of the data (trough alpha-extension). Then, the user selects two ordered nodes in the lattice and ZooM constructs a fine-grained lattice between the antecedents of these nodes. So, the general purpose of ZooM is to give a general view of concepts addressing a large data set, then focussing on part of this coarse taxonomy.

A Data-Centric Perspective on Web Services (Preliminary Report)

Serge Abiteboul, Tova Milo, Omar Benjelloun- 2001

We propose a peer-based architecture that allows for the integration of distributed data and web services. It relies on a language, Active XML, where (1) documents embed calls to web services that are used to enrich them, and (2) new web services may be defined by XQuery queries on such active documents. Embedding calls to functions or even to web services inside data is not a new idea. Our contribution, however, is in turning them into a powerful tool for data and services integration. In particular, the language includes linguistic features to control the time of service call activations. Various scenarios are captured, such as mediation, data warehousing, and distributed computation. A first prototype is described.

A dynamic warehouse for XML data of the Web

Lucie Xyleme- IEEE Transactions on Knowledge and Data Engineering 2001 - 24(2):40--47

Xyleme is a dynamic warehouse for XML data of the Web supporting query evaluation, change control and data integration. We briefly present our motivations, the general architecture and some aspects of Xyleme. The project we describe here was completed at the end of 2000. A prototype has been implemented. This prototype is now being turned into a product by a start-up company also called Xyleme.

A Model-Theoretic Approach to Regular String Relations

Luc Segoufin, Michael Benedikt, Leonid Libkin, Thomas Schwentick- IEEE Transactions on Knowledge and Data Engineering 2001

We study algebras of definable string relations-- classes of regular $n$-ary relations thatarise as the definable sets within amodel whose carrier is the set of all strings. We show that the largest such algebra -- the collectionof regular relations -- has some quite undesirablecomputational and model-theoretic properties. In contrast,we exhibit several definable relation algebras thathave much tamer behavior: for example, they admit quantifierelimination, and have finite VC dimension.We show thatthe properties of a definable relation algebra are notat all determined by the one-dimensional definable sets.We give models whose definable sets are all star-free, butwhose binary relations are quite complex, as well as models whose definable sets include all regular sets, but whichare much more restricted and tractable than the full algebraof regular relations.

A Uniform Approach for Querying Large Tree-structured Data through a Mediated Schema.

Claude Delobel, Marie-Christine Rousset- 2001

We are interested in defining and querying large and highly heterogeneous database repository for XML data and their DTDs. We consider a very simple data model with data trees and tree types for describing the data sources and a very simple query language based on tree queries with boolean conditions. The main components of the data model are a mediated schema as a view of a set of concrete tree types and a mapping expressing the connection between the mediated schema and the concrete tree types. We provide a general tree query semantics and a characterization of tree queries as composition of branch queries. We show also that we can pre-evaluate queries by storing some specific meta-information related to mapping information. We consider also the problem of query containment for tree queries and we propose some policies of relaxation for queries without answer. The research presented here was motivated by the Xyleme project at INRIA, whose objective is to develop a data warehouse for Web XML documents.

Acquisition and Maintenance of XML Data from the Web

Serge Abiteboul, Bernd Amann, Laurent Mignet, Amelie Marian, Mihai Preda- 2001

We consider the acquisition and maintenance of XML data found on the web. More precisely, we study the problem of discovering XML data on the web, i.e., in a world still dominated by HTML, and keeping it up to date with the web as best as possible, under set resources. We present a distributed architecture that is designed to scale to the billions of pages of the web. In particular, the distributed management of metadata about HTML and XML pages turns out to be an interesting issue. The scheduling of the fetching of the page is guided by the importance of pages, their expected change rate, and subscriptions/ publications of users. The importance of XML pages is defined in the standard manner based on the link structure of the web graph. It is computed by a matrix fixpoint computation. HTML pages are of interest for us only in that they lead to XML pages. Thus their importance is defined in a different manner and their computation also involves a fixpoint but on the transposed link matrix this time. The general scheduling problem is stated as an optimization problem that dispatches the resources to various tasks such as (re)reading HTML pages for discovering new XML pages or refreshing already known XML pages. This work is part of the Xyleme system. Xyleme is developed on a cluster of PCs under Linux with Corba communications. The part of the system described in this paper has been implemented. We mention first experiments.

Automatic Construction and Refinement of a Class Hierarchy over multi-valued data

Nathalie Pernelle, Marie-Christine Rousset, Veronique Ventos- IEEE Transactions on Knowledge and Data Engineering 2001 :386-398

In many applications, it becomes crucial to help users to access to a huge amount of data by clustering them in a small number of classes described at an appropriate level of abstraction. In this paper, we present an approach based on the use of two languages of description of classes for the automatic clustering of multi-valued data. The first language of classes has a high power of abstraction and guides the construction of a lattice of classes covering the whole set of the data. The second language of classes, more expressive and more precise, is the basis for the refinement of a part of the lattice that the user wants to focus on.

Automatic construction and refinement of a Class Hierarchy over semi-structured data

Nathalie Pernelle, Marie-Christine Rousset, Veronique Ventos- IEEE Transactions on Knowledge and Data Engineering 2001

In many applications, it becomes crucial to help users to access to a huge amount of data by clustering them in a small number of classes described at an appropriate level of abstraction. In this paper, we present an approach based on the use of two languages of description of classes for the automatic clustering of semistructured data. The first language of classes has a high power of abstraction and guides the construction of a lattice of classes covering the whole set of the data. The second language of classes, more expressive and more precise, is the basis for the refinement of a part of the lattice that the user wants to focus on. Our approach has been implemented and experimented on real data in the setting of the GAEL project which aims at building flexible electronic catalogs organized as a hierarchy of classes of products. Our experiments have been conducted on real data coming from the C/Net (http://www.cnet.com) electronic catalog of computer products.

Change-Centric Management of Versions in an XML Warehouse

Serge Abiteboul, Laurent Mignet, Amelie Marian, Gr Cobena- IEEE Transactions on Knowledge and Data Engineering 2001

We present a change-centric method to manage versions in a WebWareHouse of XML data. By change-centric we mean that we focus on deltas, i.e., the changesthemselves, as opposed to other approaches that might focus on snapshots orobject history. We introduce a logical representation of changes based ondeltas and unique identifiers. We present a novel way of allocating persistentidentifiers to nodes. We analyze several implementation issues and comparedifferent storage policies. Based on some requirements (certainfunctionalities that we want to support efficiently), we motivate our choice ofa specific storage policy. We store the last (most current) version and adocument containing a list of ``completed'' forward deltas with someinformation on the persistent identifiers. This work is part of the Xylemesystem. Xyleme is developed on a cluster of PCs under Linux with Corbacommunications. The part of the system described in this paper has beenimplemented. We also mention first experiments.

Combining Statistics and Semantics for Word and Document Clustering.

Marie-Christine Rousset, Alexandre Termier, Mich Sebag- IEEE Transactions on Knowledge and Data Engineering 2001

A new approach for constructing pseudo-keywords, referred to as Sense Units, is proposed. Sense Units are obtained by a word clustering process, where the underlying similarity reflects both statistical and semantic properties, respectively detected through Latent Semantic Analysis and WordNet. Sense Units are used to recode documents and are evaluated from the performance increase they permit in classification tasks. Experimental results show that accounting for semantic information in fact { decreases} the performances compared to LSI standalone. The main weaknesses of the current hybrid scheme are discussed and several tracks for improvement are sketched.

Compact labeling schemes for ancestor queries.

Serge Abiteboul, Tova Milo, Haim Kaplan- IEEE Transactions on Knowledge and Data Engineering 2001

No abstract yet.

Constraint Database Query Evaluation with Approximation

Pierangelo Veltri- 2001

We consider the problem of solving a large number of simple systems of linear constraints. This problem occurs in the context of constraint databases. The developed methodology is based on a hierarchical evaluation of the constraints which are first simplified, and replaced by approximations. We focus on systems of linear constraints over the reals, which model spatial objects, and consider both geometric and topological approximations, defined with very simple constraints. We show that these constraints can be used either to solve the initial systems, or at least to filter out unsatisfiable systems. The main contribution of the paper is the development of a set of rewriting rules, that allow the transformation of spatial object queries into equivalent ones that make use of the approximation, reducing the query evaluation cost.

Construction et affinement de hi

Nathalie Pernelle, Marie-Christine Rousset, Veronique Ventos- IEEE Transactions on Knowledge and Data Engineering 2001 - 1(1-2):81-92

In many applications, it becomes crucial to help users to access to a huge amount of data by clustering them in a small number of classes described at an appropriate level of abstraction. In this paper, we present an approach based on the use of two languages of description of classes for the automatic clustering of semistructured data. The first language of classes has a high power of abstraction and guides the construction of a basic lattice of classes covering the whole set of the data. The second language of classes, more expressive and more precise, is the basis for the refinement of a part of the lattice that the user wants to focus on.

Controle des changements de donnees semi-structurees

Laurent Mignet- 2001

Efficient consequence finding.

Laurent Simon, Alvaro del Val- IEEE Transactions on Knowledge and Data Engineering 2001 :359--365

[SdV01] . In , pages , Seattle, Washington, USA, 2001. [ bib | .ps.gz | .pdf ] We present an extensive experimental study of consequence-finding algorithms based on kernel resolution, using both a trie-based and a novel ZBDD-based implementation, which uses Zero-Suppressed Binary Decision Diagrams to concisely store and process very large clause sets. Our study considers both the full prime implicate task and applications of consequence-finding for restricted target languages in abduction, model-based diagnosis, and polynomially-bounded knowledge compilation. We show that the ZBDD implementation can push consequence-finding to a new limit, solving problems which generate over 1070 clauses.

Explicitly using defaukt knowledge in concept learning: an extended DL plus strict and default rules

Veronique Ventos, Pierre Brezellec, Henry Soldano- 2001

This work concerns the use of default knowledge in concept learning from positive and negative examples. Two connectives are added to a description logics, C-CLASSIC, previously defined for concept learning. The new connectives ($\delta$ and $\epsilon$) allow to express the idea that some properties of a given concept definition are default properties, and that some properties that should belong to the concept definition actually do not (these are excepted properties). When performing concept learning both hypotheses and examples are expressed in this new description logics but prior to learning, a saturation process using default and non default rules has to be applied to the examples in order to add default and excepted properties to their definition. As in the original C-CLASSIC, disjunctive learning is performed using a standard greedy set covering algorithm whose generalization operator is the Least Common Subsumer operator of C-CLASSIC$_{\delta \epsilon}$. We exemplify concept learning using default knowledge in this framework and show that explicitly expressing default knowledge may result in simpler concept definitions.

Ingenierie des connaissances pour les systemes d'information

Chantal Reynaud, Jean Charlet, R Teulier.- IEEE Transactions on Knowledge and Data Engineering 2001

L'Ingenierie des connaissances constitue depuis de nombreuses annees un domaine actif des recherches menees en intelligence artificielle autour de la conception et de la modelisation des systemes a base de connaissances (SBC). A l'instar de bien d'autres disciplines modelisatrices, elle consiste a concevoir des systemes dont le fonctionnement permet d'operationaliser des connaissances portant sur le traitement ou la resolution d'un probleme donne. La resolution (semi-)automatique de problemes implique deux etapes essentielles : la modelisation du probleme et d'une methode de resolution dans un cadre theorique donne l'operationalisation informatique du modele obtenu.

Mapping XML Fragments to Community Web Ontologies

Catriel Beeri, Bernd Amann, Michel Scholl, Anne-Marie Vercoustre, Irini Fundulaki- 2001

Monitoring XML data on the Web

Serge Abiteboul, Mihai Preda, Gr Cobena, Benjamin Nguyen- IEEE Transactions on Knowledge and Data Engineering 2001 July

We consider the monitoring of a flow of incoming documents. More precisely, we present here the monitoring used in a very large warehouse built from XML documents found on the web. The flow of documents consists in XML pages (that are warehoused) and HTML pages (that are not). Our contributions are the following: - a subscription language which specifies the monitoring of pages when fetched, the periodical evaluation of continuous queries and the production of XML reports. - the description of the architecture of the system we implemented that makes it possible to monitor a flow of millions of pages per day with millions of subscriptions on a single PC, and scales up by using more machines. - a new algorithm for processing alerts that can be used in a wider context. We support monitoring at the page level (e.g., discovery of a new page within a certain semantic domain) as well as at the element level (e.g., insertion of a new electronic product in a catalog). This work is part of the Xyleme system. Xyleme is developed on a cluster of PCs under Linux with Corba communications. The part of the system described in this paper has been implemented. We also mention first experiments. Keywords: XML, HTML, Change Control, Monitoring, Continuous Query, Warehouse, Web, Subscription.

Multiresolution for sat checking

Philippe Chatalic, Laurent Simon- IEEE Transactions on Knowledge and Data Engineering 2001 - 10(4):451--481

This paper presents a system based on new operators for handling sets of propositional clauses compactly represented by means of ZBDDs. The high compression power of such data structures allows efficient encodings of structured instances. A specialized operator for the distribution of sets of clauses is introduced and used for performing multiresolution on clause sets. Cut eliminations between sets of clauses of exponential size may then be performed using polynomial size data structures. The textsc{ZRes} system, a new implementation of the Davis-Putnam procedure of 1960, solves two hard problems for resolution, that are currently out of the scope of the best SAT provers.

Optimisation de Requ

Fanny Wattez- 2001

This thesis studies how to optimize queries over data whose structure is a tree. On the one hand, data are stored in the OODBMS O$_{2}$. While running a benchmark of OQL queries, we emphasize that this system does not cope well with large associative accesses and we propose solutions to improve performances. Moreover, we show the pros and cons of navigation versus joins in a real system. On the other hand, data are XML documents, i.e., semi-structured data. In this context, optimization techniques are different. In order to answer quickly structured queries over a huge number of documents, which technique can we use? We have chosen to index documents in full-text indexes. We explain how we extend this technology and how, once documents are indexed, queries are evaluated.

Pattern Tree Queries in Xyleme

Sophie Cluet, Vincent Aguilera, Fr Boiscuvier- 2001

In this paper, we focus on the implementation of a special purpose algebraic operator, called PatternScan, using the full text index technology.

Querying a Web Scale XML Repository

Sophie Cluet, Vincent Aguilera, Fanny Wattez, Pierangelo Veltri, Dan Vodislav- 2001

XML is becoming a standard to represent and exchange structured data over the Web. Undoubtedly, more and more data on the Web will be in XML form, as more and more DTDs will be available to describe it. Given this, the Xyleme project proposes to study and build a dynamic World Wide XML distributed data warehouse, capable of storing a very large number of XML data available on the Web. Among the many problems addressed by the Xyleme project, in this paper, we focus on query processing in Xyleme. In particular: (i) how to query an integrated view, (ii) how documents are indexed in the repository, and (iii) how the algebraic plans are distributed and evaluated.

Representing and Querying XML with Incomplete Information

Serge Abiteboul, Victor Vianu, Luc Segoufin- IEEE Transactions on Knowledge and Data Engineering 2001

We study the representation and querying of XML with incomplete information. We consider a simple model for XMLdata and their DTDs, a very simple query language, and a representation system for incomplete information in the spiritof the representations systems developed by Imielinski and Lipski for relational databases. In the scenario we consider, theincomplete information about an XML document is continuously enriched by successive queries to the document. Weshow that our representation system can represent partial information about the source document acquired by successivequeries, and that it can be used to intelligently answer new queries. We also consider the impact on complexity ofenriching our representation system or query language with additional features. The results suggest that our approachachieves a practically appealing balance between expressiveness and tractability. The research presented here wasmotivated by the Xyleme project at INRIA, whose objective it to develop a data warehouse for Web XML documents.

Rewriting Tree Queries into XPath Expressions

Catriel Beeri, Bernd Amann, Michel Scholl, Anne-Marie Vercoustre, Irini Fundulaki- 2001

A Community Web portal is a set of tools for a community of people who want to share information on a certain domain via the Web. The backbone of this system is an ontology which is understood by all members of the community and used as the common interface component for integrating, querying, structuring and exchanging information. This paper describes a simple but nevertheless powerful language for integrating XML documents into a Community Web system. More precisely, we describe how to add and exploit XPath enabled Web servers by mapping standard XPath location paths to conceptual paths in the system ontology. We present a query rewriting algorithm using such mappings. It transforms a user query into a set of XML queries based on XPath patterns for selecting XML fragments. Finally, we describe some evaluation techniques for reducing the size of XML data returned by the XML source for query evaluation.

SatEx : un cadre de travail dynamique pour l'exp

Philippe Chatalic, Laurent Simon- IEEE Transactions on Knowledge and Data Engineering 2001 June:259--270

SatEx est un site web orient

SATEx: a Web-based Framework for SAT Experimentation.

Philippe Chatalic, Laurent Simon- IEEE Transactions on Knowledge and Data Engineering 2001 June

SatEx is a web site devoted to SAT experimentation. It is not only a front end to a database gathering an exhaustive number of executions, but it also allows dynamic results synthesis as well as detailed explorations of experimentation results. Being dynamically generated and constantly updated and improved, this site can be considered as an almost always up-to-date SAT experimentation paper. To the current time, SatEx presents the results of more than 450 CPU days on a recent machine. In a few months, this site has been well received by the SAT community and has reached more than 20000 hits. SatEx site is available at http://www.lri.fr/~simon/satex/satex.php3.

SatEx: towards and exhaustive and up-to-date SAT experimentation.

Laurent Simon- IEEE Transactions on Knowledge and Data Engineering 2001

The SatEx web site is devoted to experimentation around SAT. It is not only a front-end to a database gathering an exhaustive number of executions, but it also allows dynamic result synthesis as well as a detailed exploitation of all experimentations. Being dynamically generated and constantly evolving, this site can be considered as an always up-to-date SAT experimentation paper. At this time, the first realease of the SatEx http://www.lri.fr/~simon/satex/satex.php3 is already used by the SAT community, with more than 15000 hits in a few months. The new release will present the results of more than 450 cpu-days on a recent machine and will incorporate some major improvements.

Semantic Integration of XML Heterogeneous Data Sources

Dan Vodislav, Chantal Reynaud, Jean-Pierre Sirot- IEEE Transactions on Knowledge and Data Engineering 2001 Juillet

With the current explosion of data, retrieving and integrating information from various sources is a critical problem. The designer has to specify a mediated schema providing a homogeneous view on the sources. In this paper, we report on an initial work toward automatically generating mappings between elements in the sources and in the mediated schema. Information sources we are interested in are XML documents in respect with a Document Type Definition (DTD). We describe the Xyleme project, which is the context of this study.We present our approach implemented in the SAMAG system to automatically find mappings on the basis of semantic and structural criteria. Finally, we report the first results of an experiment where SAMAG has been applied to XML documents in the cultural domain.

String Operations in Query Languages

Luc Segoufin, Michael Benedikt, Leonid Libkin, Thomas Schwentick- IEEE Transactions on Knowledge and Data Engineering 2001

We study relational calculi with support for string operations. Most prior proposals werebased on adding the operation of concatenation to first-order logic. Such an extension isproblematic as the relational calculus becomes computationally complete, which in turnimplies strong limits on the ability to perform optimization and static analysis ofproperties such as query safety. In contrast, we look at extensions of relational calculusthat have nice expressiveness, decidability, and safety properties, while corresponding tosets of string operations used in SQL. We start with an extension based on the stringordering and LIKE predicates. We then extend this basic model to include string lengthcomparison. While both of these share some of the attractive properties of relationalcalculus (low data complexity for generic queries, effective syntax for safe queries,correspondence with an algebra), there is a large gap between these calculi in expressivepower and complexity. The smaller `basic model' has low data complexity for all queries,but is quite limited in expressive power. The latter handles many natural stringoperations, but includes queries with high data complexity. We thus also explore thespace between these two extremes. We consider two intermediate languages: the firstextends our base language with functions that trim/add leading characters, and the otherextends it by adding the full power of regular-expression pattern matching. We showthat both these extensions inherit many of the attractive properties of the basic model:they both have corresponding algebras expressing safe queries, and low complexity ofquery evaluation.

Towards a Flexible Model for Data and Web Services Integration

Serge Abiteboul, Tova Milo, Omar Benjelloun- 2001

We present a data-oriented approach to web services integration. To this end, we introduce a new framework based on a model that embeds service calls within semistructured data. The framework captures various integration scenarios including mediation and warehousing, and provides fusion based on object identifiers. Moreover, by allowing service call parameters and responses to contain calls to other services, the framework enables distributed computation over the web. This report describes our ongoing work on the subject, mainly through examples.

Une experience de representation d'une ontologie dans le mediateur PICSEL

Chantal Reynaud, Brigitte Safar- IEEE Transactions on Knowledge and Data Engineering 2001 Juin:329-348

An information integration system provides a uniform query interface to a collection of autonomous and distributed sources, connected to each other thanks to a global mediated schema, called domain ontology.The problem addressed in the paper is how to represent such an ontology into CARIN-ALN, a formalism combining classes and rules. We focus on the choices for representing classes, properties and constraints using the characteristics of the formalism. We also propose a method in two steps for representing a domain ontology in the framework of a mediator. The first step is directed by the formalism and the functionalities of the mediator. The second step is an optimization phase guided by the way functionalities of the mediator are implemented.

Views in a Large Scale XML Repository

Sophie Cluet, Pierangelo Veltri, Dan Vodislav- 2001

We are interested in maintaining and querying views in a huge and highly heterogeneous XML repository (Web scale). In this context, views are very large and there is no apparent limitation to their size. This raises interesting problems that we address in the paper: (i) how to distribute views over several machines without having a negative impact on the query translation process; (ii) how to quickly select the relevant part of a view given a query; (iii) how to minimize the cost of communicating potentially large queries to the machines where they will be evaluated. The presented system of views is implemented in the Xyleme system.

When is the evaluation of conjunctive queries tractable ?

Luc Segoufin, Martin Grohe, Thomas Schwentick- IEEE Transactions on Knowledge and Data Engineering 2001

The evaluation of conjunctive queries is hard both with respect to itscombined complexity (NP-complete) and its parameterized complexity (W[1]-complete). It becomes tractable (PTIME for combined complexity, FPT for parameterized complexity), when conjunctive queries have a bounded tree-width structure. In this paper we show that, in a sense, this is optimal both with respect to combined and parameterized complexity : Evaluating conjunctive queries is tractable precisely for those classes of queries that have a bounded tree-width structure. A technical result of independent interest is that colored grid homomorphism problem is NP-complete and, if parameterized by the grid size, WI-complete.

Xyro: The Xyleme Robot Architecture

Laurent Mignet, Vincent Aguilera, Sebastien Ailleret, Pierangelo Veltri- 2001

In this paper we address the problem of loading data from the web. We present the architecture of Xyro, a crawler designed to fetch data from the Web, and particularly XML data. We describe our experience in designing and implementing this robot, addressing some standard problems in the crawler implementation, and presenting our way to overcome in some of them.

Managing RDF Metadata for Community Webs

Bernd Amann, Michel Scholl, Vassilis Christophides, Anne-Marie Vercoustre, Irini Fundulaki, Sofia Alexaki, Greg Karvounarakis, Karsten Tolle- 2000

A zero-one law for random sentences in description logics.

Marie-Christine Rousset, Bernard Ycart- 2000

We consider the set of sentences in a decidable fragment of first order logic, restricted to unary and binary predicates, expressed on a finite set of bjects. Probability distributions on that set are defined and studied. For large sets of objects, is is shown that under these distributions, random entences typically have a very particular form, and that all monotone symmetric properties of these random sentences have a probability close to 0 or 1.

Automatic Aggregation using Explicit Metadata

Stephane Grumbach, Leonardo Tininini- 2000

The paper presents a logical data model for statistical data with an explicit modeling of metadata, which allows to perform automatic aggregation. The data are stored in standard relations from the relational model, while the metadata, defining the semantics of the relations, are represented by numerical dependencies which specify the way the summary values are defined in terms of micro-data, as well as the interrelationships among summary values. The present model supports standard relational languages such as SQL. Relations with numerical dependencies are then seen as {it statistical views} over initial relations of micro-data. Queries can be asked either against the views or directly against the initial relations, and in this later case answered, when possible, using the views. The numerical dependencies of the views are run together with the query to compute the answer to the query. This is handled in a completely automatic manner, with no need for the user to deal with the intricacy of metadata. The mechanism has been tested by an implementation in Prolog of meaningful examples of queries and dependencies. It is shown in particular that various classical problems in the realm of statistical and multidimensional databases can be easily modeled and solved in the present framework. Finally, the proposed formalism is shown to be useful for statistical database schema design.

Benchmarking Queries over Trees: Learning the Hard Truth the Hard

Sophie Cluet, Fanny Wattez, Veronique Benzaken, Guy Ferran, Christian Fiegel- 2000

In October 1997, we started what we thought would be a comprehensive benchmarking of OQL~cite{Catt97,Clue98} queries. Two years of hard work and many blunders later, we have tested one tenth of what we were shooting for. The first part of this paper should be useful to would-be benchmarkers: it explains how we could have avoided losing time. In the second part, we show that our few results are in fact extremely informative. Notably, we understand now why O$_2$ does not cope well with large associative accesses requiring duplicate I/Os and how this could be seriously improved. Also, we have a better grasp on the pros and cons of navigation versus joins in a real system and believe we know how to efficiently evaluate queries over hierarchical structures.

Diriger la réutilisation de composants à l'aide d'ontologies

Chantal Reynaud- IEEE Transactions on Knowledge and Data Engineering 2000

Cet article porte sur la construction automatique de modeles de resolution de problemes a partir d'ontologies du domaine specifiées formellement. Il est plus particulierement centre sur les aspects reutilisation. L'objectif est de montrer que l'approche adoptee est differente des approches de reutilisation habituelles qui reposent sur la reutilisation de modeles generiques a affiner et a adapter. Au contraire, nous proposons de construire un modele de raisonnement d'une application par assemblage de composants elementaires generiques predefinis. Une organisation de bibliotheques de tels composants est decrite. Cette approche a ete implementee dans ASTREE. Elle est basee sur une mise en correspondance de connaissances specifiees dans des taches et des procedures avec des connaissances d'une ontologie du domaine. Ces mises en correspondance sont facilitees par la syntaxe precise et la semantique non ambigue du langage de modelisation de l'ontologie.

Finding Successful Queries in a Mediator Context

Brigitte Safar, Alain Bidault, Christine Froidevaux- IEEE Transactions on Knowledge and Data Engineering 2000

In this paper, we study failing queries posed to a mediator in an information integration system and expressed in the logical formalism of the information integration system PICSEL. First, we present the notion of concept generalisation in a concept hierarchy that is used to repair failing queries. Then, we address two problems arising while rewritting a query using views. The first problem concerns queries that cannot be rewritten due to a lack of sources, the second one concerns queries that have only unsatisfiable rewritings.

Hierarchical Optimization of Linear Constraint Processing

Stephane Grumbach, Philippe Rigaux, Pierangelo Veltri- 2000

We consider the problem of solving a {it large} number of {it simple} systems of constraints. This problem occurs in the context of databases with very large collections of data while the constraints are over a small number of variables. The methodology we develop is based on a hierarchical evaluation of the constraints which are first simplified, and replaced by constraints approximating the initial ones. We focus on systems of linear constraints over the reals, which model spatial objects, and consider both geometric and topological approximations, defined with very simple constraints. We show that these constraints can be used either to solve the initial systems, or at least to filter out unsatisfiable systems. More generally, we consider the manipulation of the spatial objects with first-order queries. We show how the queries can be evaluated by taking advantage of the approximations. In general, it is undecidable if a query can be completely answered on approximated data. The main contribution of the paper is the development of a set of rewriting rules, that allow the transformation of queries into equivalent ones that make use of the approximation. As usual with optimization problems, an optimal solution is out of reach, but the rules are optimal for individual operations of relational algebra. The efficiency of the technique is important for the range of applications considered, where the probability of filtering from the approximated data is very high. The technique is illustrated on a practical example.

Integrating Ontologies and Thesauri for RDF schema creation and

Bernd Amann, Michel Scholl, Irini Fundulaki- 2000

In this paper we present a new approach for building metadata schemas by integrating existing ontologies and structured vocabularies (thesauri). This integration is based on the specification of inclusion relationships between thesaurus terms and ontology concepts and results in application-specific metadata schemas incorporating the structural views of ontologies and the deep classification schemes provided by thesauri. We will also show how the result of this integration can be used for RDF schema creation and metadata querying. In our context, (metadata) queries exploit the inclusion semantics of term relationships, which introduces some recursion. We will present a fairly simple database-oriented solution for querying such metadata which avoids a (recursive) tree traversal and is based on a linear encoding of thesaurus hierarchies.

Le Web : du texte

Sophie Cluet- 2000

Le Web est une banque de donn

Manipulating Interpolated Data is Easier Than You Thought

Stephane Grumbach, Philippe Rigaux, Luc Segoufin- 2000

Data defined by interpolation is frequently found in new applications involving, for instance, geographical concepts, moving objects, and spatio-temporal data. This data leads to potentially infinite collections of items, (e.g. the elevation of any point in a map), whose definition is based on the association of a collection of samples with an interpolation function. We first argue that the manipulation of the data through direct access to the samples and interpolation functions easily leads to cumbersome or inaccurate queries. We therefore suggest hiding the samples and the interpolation function away from the logical level, and letting the system manipulate them at the physical level. We propose to model such data conceptually using infinite relations (e.g. the map with elevation yields an infinite ternary relation) which can be manipulated through standard relational query languages (e.g. SQL), with no mention of the interpolated definition. This approach is simple and establishes a clear separation between logical and physical levels. It requires standard relational query languages whose semantics is well-defined. Moreover, we show that the evaluation of queries, which includes the computation of the sampling collection and the interpolation function of the output, can be done very efficiently, and we describe the physical level algorithms. We show that this approach can be easily supported by standard database techniques, and demonstrate this with a prototype.

Multi-Resolution on compressed sets of clauses.

Philippe Chatalic, Laurent Simon- IEEE Transactions on Knowledge and Data Engineering 2000 November:2--10

This paper presents a system based on new operators for handling sets of propositional clauses represented by means of ZBDDs. The high compression power of such data structures allows efficient encodings of structured instances. A specialized operator for the distribution of sets of clauses is introduced and used for performing multi-resolution on clause sets. Cut eliminations between sets of clauses of exponential size may then be performed using polynomial size data structures. The ZRes system, a new implementation of the Davis-Putnam procedure of 1960, solves two hard problems for resolution, that are currently out of the scope of the best SAT provers.

On the Content of Materialized Aggregate Views

Stephane Grumbach, Leonardo Tininini- 2000

We consider the problem of answering queries using only materialized views. We first show that if the views subsume the query from the point of view of the information content, then the query can be answered using only the views, but the resulting query might be extremely inefficient. We then focus on aggregate views and queries over a single relation, which are fundamental in many applications such as data warehousing. We show that in this case, it is possible to guarantee that as soon as the views subsume the query, it can be completely rewritten in terms of the views in a simple query language. Our main contribution is the conception of various rewriting algorithms which run in polynomial time, and the proof of their completeness which relies on combinatorial arguments. Finally, we discuss the choice of materializing or not ratio views such as average and percentage, important for the design of materialized views. We show that it has an impact on the information content, which can be used to protect data, as well as on the maintenance of views.

On Wrapping Query Languages and Efficient XML Integration

Sophie Cluet, Vassilis Christophides, Jerome Simeon- 2000

Modern applications (portals, e-commerce, digital libraries, etc.) require integrated access to various information sources (from traditional RDBMS to semistructured Web repositories), fast deployment and low maintenance cost in a rapidly evolving environment. Because of its flexibility, there is an increasing interest in using XML as a middleware model for such applications. XML enables fast wrapping and declarative integration. However, query processing in XML-based integration systems is still penalized by the lack of an algebra with adequate optimization properties and the difficulty to understand source query capabilities. In this paper, we propose an algebraic approach to support efficient query evaluation in XML integration systems. We define a general purpose algebra suitable for semistructured or XML query languages. We show how this algebra can be used, with appropriate type information, to also wrap more structured query languages such as OQL or SQL. Finally, we develop appropriate optimization techniques.

Querying Spatial Databases via Topological Invariants

Victor Vianu, Luc Segoufin- IEEE Transactions on Knowledge and Data Engineering 2000 - 61(2):270-301

The paper investigates the use of topological annotations(called topological invariants)to answer topological queries in spatial databases.The focus is on the translation of topological queries against thespatial database into queries against the topological invariant.The languages considered are first-order on the spatial database side,and fixpoint counting, fixpoint, and first-order on the topological invariant side.In particular, it is shown that fixpoint countingexpresses precisely all the ptime queries on topological invariants;if the regions are connected, fixpoint expresses all ptime querieson topological invariants.

Querying XML Documents in Xyleme

Sophie Cluet, Vincent Aguilera, Fanny Wattez, Pierangelo Veltri, Dan Vodislav- 2000

XML is a standard to exchange structured data over the Web. It is being widely adopted. Undoubtedly, more and more data on the Web will be in XML form, as more and more DTDs will be available to describe it. Given this, the Xyleme project proposes to study and build a dynamic World Wide XML warehouse, i.e., a data warehouse capable of storing all the XML data available on the planet. Among the many problems addressed by the Xyleme team, we focus in this paper on query processing.

R

Brigitte Safar, Alain Bidault- IEEE Transactions on Knowledge and Data Engineering 2000 - 1:225-235

Un m

R

Philippe Chatalic, Laurent Simon- IEEE Transactions on Knowledge and Data Engineering 2000 june:107--118

Cet article pr

Reachability and Connectivity Queries in Constraint Databases

Luc Segoufin, Michael Benedikt, Martin Grohe, Leonid Libkin- 2000

It is known that standard query languages for constraint databases lack the power to express connectivity properties. Such properties are important in the context of geographical databases, where one naturally wishes to ask queries about connectivity (what are the connected components of a given set?) or reachability (is there a path from A to B that lies entirely in a given region?). No existing constraint query languages that allow closed form evaluation can express these properties. In the first part of the paper, we show that in principle there is no obstacle to getting closed languages that can express connectivity and reachability queries. In fact, we show that adding any topological property to standard languages like FO(<.+) and FO(<,+,*) results in a closed language. In the second part of the paper, we look for tractable closed languages for expressing reachability and connectivity queries. We introduce path logic, which allows one to state properties of paths with respect to given regions. We show that it is closed, has polynomial time data complexity for linear and polynomial constraints, and can express a large number of reachability properties beyond simple connectivity. Query evaluation in the logic involves obtaining a discrete abstraction of a continuous path, and model-checking of temporal formulae on the discrete structure.

Relational Transducers for Electronic Commerce

Serge Abiteboul, Victor Vianu, Yelena Yesha, Brad S. Fordham- 2000

Electronic commerce is emerging as one of the major Web-supported applications requiring database support. We introduce and study high-level declarative specifications of business models, using an approach in the spirit of active databases. More precisely, business models are specified as relational transducers that map sequences of input relations into sequences of output relations. The semantically meaningful trace of an input-output exchange is kept as a sequence of log relations. We consider problems motivated by electronic commerce applications, such as log validation, verifying temporal properties of transducers, and comparing two relational transducers. Positive results are obtained for a restricted class of relational transducers called Spocus transducers (for semi-positive outputs and cumulative state). We argue that despite the restrictions, these capture a wide range of practically significant business models.

Repairing Queries in a Mediator Approach

Brigitte Safar, Alain Bidault, Christine Froidevaux- IEEE Transactions on Knowledge and Data Engineering 2000

In this paper, we study unsatisfiable queries posed to a mediator in an information integration system and expressed in a logical formalism. First, we characterise conflicts as the minimal causes of the unsatisfiability of a query. Then, thanks to our cooperative query answering process, we produce its set of repairs. A repair is a query that does not generate any conflict and that has a common generalisation with the initial query and is semantically close to it.

Rewriting Conjunctive Queries using Views in Description Logics with Existential Restrictions

Marie-Christine Rousset, Francois Goasdou- 2000

In database, rewriting queries using views has received significant attention because of its relevance to several fields such as query optimization, data warehousing, and information integration. In those settings, data used to answer a query are restricted to be extensions of a set of predefined queries (views). The information integration context is typical of the need of rewriting queries using views for answering queries. Users of information integration systems do not pose queries directly to the (possibly remote) sources in which data are stored but to a set of virtual relations that have been designed to provide a uniform and homogeneous access to a domain of interest. When the contents of the sources are described as views over the (virtual) domain relations, the problem of reformulating a user query into a query that refers directly to the relevant sources becomes a problem of rewriting queries using views. In this paper, we study the problem of rewriting conjunctive queries over DL expressions into conjunctive queries using a set of views that are a set of distinguished DL expressions, for three DLs allowing existential restrictions: FLE, ALE and ALEN. For FLE, we present an algorithm that computes a representative set of all the rewritings of a query. In the full version of this paper (cf. Technical report), we show how to adapt it to deal with the negation of atomic concepts in the queries and in the views, in order to obtain a rewriting algorithm for ALE. Finally, we show that obtaining a finite set representative of all the rewritings of a query is not guaranteed in ALEN.

Satifiability and Relevance for Queries over Active Documents

Serge Abiteboul, Bogdan Marinoiu, Pierre Bourhis- IEEE Transactions on Knowledge and Data Engineering 2000

The use of Carin language and algorithms for Information Integration: the Picsel system

Marie-Christine Rousset, Francois Goasdou, V Latt- 2000

Picsel is an information integration system over sources that are distributed and possibly heteregeneous. The approach which has been chosen in Picsel is to define an information server as a knowledge-based mediator in which Carin is used as the core logical formalism to represent both the domain of application and the contents of information sources relevant to that domain. In this paper, we describe the way the expressive power of the Carin language is exploited in the Picsel information integration system, while maintaining the decidability of query answering. We illustrate it on examples coming from the tourism domain, which is the first real case that we have to consider in Picsel, in collaboration with France Telecom (CNET) and the travel agency Degriftour.

Towards cost-based optimization for data-intensive Web service computations

Ioana Manolescu, Nicolaas Ruberg, Gabriela Ruberg- IEEE Transactions on Knowledge and Data Engineering 2000

INRIA Research Report no. 5222

Towards Learning in CARIN-ALN

Veronique Ventos, Celine Rouveirol- 2000

In this paper we investigate a new language for learning, which combines two well-known representation formalisms, Description Logics and Horn Clause Logics. Our goal is to study the feasability of learning in such a hybrid description - horn clause language, namely CARIN-ALN \cite{MCAI98a}, in the presence of hybrid background knowledge, including a Horn clause and a terminological component. After setting our learning framework, we present algorithms for testing example coverage and subsumption between two hypotheses, based on the existential entailment algorithm studied in\cite{MCAI98a}. While the hybrid language is more expressive than horn clause logics alone, the complexity of these two steps for CARIN-ALN remains bounded by their respective complexity in horn clause logics.

Zres: The old Davis-Putnam procedure meets ZBDDs

Philippe Chatalic, Laurent Simon- IEEE Transactions on Knowledge and Data Engineering 2000 - 1831:449--454

ZRes is a propositional prover based on the original procedure of Davis and Putnam, as opposed to its modified version of Davis, Logeman and Loveland, on which most of the current efficient SAT provers are based. On some highly structured SAT instances, such as the well known Pigeon Hole and Urquhart problems, both proved hard for resolution, ZRes performs very well and surpasses all classical SAT provers by an order of magnitude.

Integrating Ontologies and Thesauri to build RDF Schemas

Bernd Amann, Irini Fundulaki- 1999

In this paper we present a new approach for building RDF schemas by integrating existing ontologies and structured vocabularies (thesauri). We will present a simple mechanism based on the specification of inclusion relationships between thesaurus terms and ontology concepts and show how these relationships can be exploited to create application-specific RDF schemas incorporating the structural views of ontologies and deep classification schemes provided by thesauri.

Active Views for Electronic Commerce

Serge Abiteboul, Bernd Amann, Sophie Cluet, Tova Milo, Anat Eyal, Laurent Mignet- 1999

Electronic commerce is emerging as a major Web-supported application. In this paper we argue that database technology can, and should, provide the backbone for a wide range of such applications. More precisely, we present here the ActiveViews system, which, relying on an extensive use of database features including views, active rules (triggers), and enhanced mechanisms for notification, access control and logging/tracing of users activities, provides the needed basis for electronic commerce. Based on the emerging XML standards (DOM, query languages for XML, etc.), the system offers a novel declarative view specification language, describing the relevant data and activities of all actors (e.g. vendors and clients) participating in electronic commerce activities. Then, acting as an application generator, the system generates an actual, possibly customized, Web application that allows users to perform the given set of controlled activities and to work interactively on the specified data in a standard distributed environment The ActiveView system is developed at INRIA on top of Java and ArdentSoftware's XML repository

An Algebra for Pomsets

Stephane Grumbach, Tova Milo- 1999

We study languages for manipulating partially ordered structures with duplicates (e.g. trees, lists). As a general framework, we consider the pomset (partially ordered multiset) datatype. We introduce an algebra for pomsets, which generalizes traditional algebras for (nested) sets, bags and lists. This paper is motivated by the study of the impact of different language primitives on the expressive power. We show that the use of partially ordered types increases the expressive power significantly. Surprisingly, it turns out that the algebra when restricted to both unordered (bags) and totally ordered (lists) intermediate types, yields the same expressive power as fixpoint logic with counting on relational databases. It therefore constitutes a rather robust class of relational queries. On the other hand, we obtain a characterization of PTIME queries on lists by considering only totally ordered types.

Aspects th

Bouthinon, D., H. Soldano and V. Ventos- IEEE Transactions on Knowledge and Data Engineering 1999 :145-152

Bases de donn'ees : quelques cl'es pour l'ouverture vers le r'eseau

Sophie Cluet- 1999

Apr`es avoir longtemps v'ecu en autarcie, les syst`emes d'information s'ouvrent vers le r'eseau. L''emergence du Web au d'ebut de cette derni`ere d'ecennie a certainement accentu'e l'ampleur d'un mouvement que le d'eveloppement des r'eseaux, l'accroissement des donn'ees num'eris'ees et la diversit'e des traitements rendaient in'eluctable. Aujourd'hui, les syst`emes informatiques g`erent, importent, diffusent, 'echangent, int`egrent de gros volumes de donn'ees parfois r'epliqu'ees, souvent dans des formats diff'erents. Les syst`emes de gestion de bases de donn'ees (SGBD) permettent la gestion efficace et coh'erente de gros volumes de donn'ees, interrog'ees et mises `a jour de fac{c}on concurrente par plusieurs utilisateurs. Ils sont donc bien 'evidemment au coe ur du probl`eme. Conc{c}us `a l'origine comme des syst`emes ferm'es pour des donn'ees r'eguli`eres et des applications de gestion, ils 'evoluent pour devenir la brique de base de nouvelles applications ouvertes sur le r'eseau telles que librairies num'eriques, syst`emes Intra/Internet, serveurs de Web, commerce 'electronique, etc. Ces applications sont `a l'origine d'une mine de nouveaux probl`emes pour les bases de donn'ees. Cette th`ese traite plusieurs de ces probl`emes. Elle s'articule autour d'une id'ee forte et de son corollaire~: l'utilisation de langages d'eclaratifs et leur optimisation.

Complexity of Answering Queries Using Materialized Views

Serge Abiteboul, Oliver Duschka- IEEE Transactions on Knowledge and Data Engineering 1999

We study the complexity of the problem of answering queries using materialized views. This problem has attracted a lot of attention recently because of its relevance in data integration. Previous work considered only conjunctive view definitions. We examine the consequences of allowing more expressive view definition languages. The languages we consider for view definitions and user queries are: conjunctive queries with inequality, positive queries, datalog, and first-order logic. We show that the complexity of the problem depends on whether views are assumed to store all the tuples that satisfy the view definition, or only a subset of it. Finally, we apply the results to the view consistency and view self-maintainability problems which arise in data warehousing.

Davis et putnam, 40 ans apr

Philippe Chatalic, Laurent Simon- IEEE Transactions on Knowledge and Data Engineering 1999 :9--16

Cet article pr

Du Chargement en Masse dans une Base de Donn

Sihem Amer-Yahia- 1999

Form Management in the Vicomte Workflow System

Daniel Chan- 1999

It is one of the distinguishing characteristics of workflow management systems to integrate automated processes with human processes in an application system. The way a human process is modelled in a workflow system is by capturing the significant events of the process. These events include notifying the user who is assigned to perform the process, disseminating information that is necessary for the user to carry out the process, and collecting information that is generated as the result of the user process. Such events are often presented to the user or captured from the user using forms. This paper presents a form management system that is part of the Vicomte workflow management system. The major components include a form specification language, a facility for dynamically generating HTML pages with client-side validation to be used over the internet, as well as processing modules for the generated HTML pages on the server side. The architecture of the form management system is rather general and therefore is applicable for many systems requiring an effective form processing facility.

In search of the lost schema

Stephane Grumbach, Gianni Mecca- 1999

We study the problem of rediscovering the schema of nested relations that have been encoded as strings for storage purposes. We consider various classes of encoding functions, and consider the mark-up encodings, which allow to find the schema without knowledge of the encoding function, under reasonable assumptions on the input data. Depending upon the encoding of empty sets, we propose two polynomial on-line algorithms (with different buffer size) solving the schema finding problem. We also prove that with a high probability, both algorithms find the schema after examining a fixed number of tuples, thus leading in practice to a linear time behavior with respect to the database size for wrapping the data. Finally, we show that the proposed techniques are well-suited for practical applications, such as structuring and wrapping HTML pages and Web sites.

Int

Jerome Simeon- 1999

We design novel techniques and tools to simplify the exploitation of heterogeneous data sources. So far, two main approaches have been competing. Structured integration systems take advantage of database techniques to offer efficient query evaluation over large sources. However, they lack the flexibility required in a truly heterogeneous environment. Semistructured systems rely on schema-less data models in order to provide simple and fast integration over any set of sources. Yet, the absence of a schema entails an increased complexity to interpret the data. It is also a major disadvantage during query evaluation. We introduce a type system for the semistructured data model and we define YATL, a declarative integration language. YATL is able to resolve structural conflicts between sources and features high-level primitives for the manipulation of collections and references. Type information serves as a guide during the specification of integration programs and enables verifications procedures through type-checking. Moreover, YATL allows to preserve as much flexibility as possible by exploiting genericity properties. The YAT system is a prototype which implements these tools. Its architecture is designed to minimize the work required for the integration process. It has been used successfully to build various applications. We propose an operational model suitable to describe the evaluation of YATL programs. It contains an algebra for semistructured data, whose equivalences allow us to recover all the major standard optimization techniques. It also opens some new optimization opportunities in the semistructured context. Finally, a precise description of source query capabilities within the operational model can be used to obtain an efficient distributed evaluation.

Modeling information sources for information integration

Chantal Reynaud, Francois Goasdou- IEEE Transactions on Knowledge and Data Engineering 1999 May:121--138

The aim of this paper is to present an approach and automated tools for designing knowledge bases describing the contents of information sources in PICSEL knowledge based mediators. We address two problems: the abstraction problem and the representation problem, when information sources are relational databases. In a first part, we present an architectural overview of the PICSEL mediator showing its main knowledge components. Then, we describe our approach and tools that we have implemented to indentify, by means of an abstraction process, the main relevant concepts, called semantic concepts, in an Entity Relationship model and to help representing these concepts using CARIN, a logical language combining description logics and Datalog Rules, and using specific terms in the application domain model.

Notes on Query Evaluation and Optimization

Stephane Grumbach, Philippe Rigaux, Luc Segoufin- 1999

The optimization of queries over constraint databases is a prerequisite to the use of this methodology for data intensive applications. In this paper, we present some preliminary results, together with promising directions of research to address this fundamen tal issue.

On the Orthographic Dimension of Constraint Databases

Stephane Grumbach, Philippe Rigaux, Luc Segoufin- 1999

One of the most important advantages of constraint databases is their ability to represent and to manipulate data in arbitrary dimension within a uniform framework. Although the complexity of querying such databases by standard means such as first-order queries has been shown to be tractable for reasonable constraints (e.g. polynomial), it depends badly (roughly speaking exponentially) upon the dimension of the data. A precise analysis of the trade-off between the dimension of the input data and the complexity of the queries reveals that the complexity strongly depends upon the use the input makes of its dimensions. We introduce the concept of orthographic dimension, which, for a convex object $O$, corresponds to the dimension of the (component) objects $O_1,...,O_n$, such that $O=O_1 imescdots imes O_n$. We study properties of databases with bounded orthographic dimension in a genera l setting of o-minimal structures, and provide a syntactic characterization of first-order orthographic dimension preserving queries. The main result of the paper concerns linear constraint databases. We prove that orthographic dimension preserving Boolean combination of conjunctive queries can be evaluated independently of the global dimension, with operators limited to the orthographic dimension, in parallel on the components. This results in an extremely efficient optimization mechanism, very easy to use in practical applications.

Ozone: Integrating Structured and Semistructured Data

Serge Abiteboul, J. Widom, T. Lahiri- 1999

In this paper, we extend the standard object database model ODMG and the query language OQL with the ability to handle semistructured data as found, for instance, in XML documents. In this extension, a very structured portion of the data, e.g., a relation or a standard ODMG object, may contain entrypoints to semistructured portions of the data and vice versa. The management of such ``hybrid'' data often present in modern database applications constitutes the main contribution of our work. Although our modeling of semistructured data is largely inspired by OEM and {em Lorel}, it is in some aspects richer. In particular, we consider semistructured lists and distinguish between components of semistructured data objects and references. These are essential features to properly handle XML data. Also, since semistructured data here lives in the context of structured data, we have to provide a refined control of typing: typing is strictly enforced on the regular portion of the data and, like in Lorel, coercion is used extensively on the semistructured portion. Finally, we introduce the notion of {em proxy} that allows to view as semistructured (so untyped) some data that is in reality structured in the database. We present in the paper the {em Ozone} system that is built as a wrapper on top of the ODMG compliant O$_2$ system. Ozone fully supports the extensions of the ODMG model and of OQL that we introduce.

Querying Aggregate Data

Stephane Grumbach, Maurizio Rafanelli, Leonardo Tininini- 1999

We introduce a first-order language with real polynomial arithmetic and aggregation operators (count, iterated sum and multiply), which is well suited for the definition of aggregate queries involving complex statistical functions. It offers a good trade-off between expressive power and complexity, with a tractable data complexity. Interestingly, some fundamental properties of first-order with real arithmetic are preserved in the presence of aggregates. In particular, there is an effective quantifier elimination for formulae with aggregation. We consider the problem of querying data that has already been aggregated in aggregate views, and focus on queries with an aggregation over a conjunctive query. Our main conceptual contribution is the introduction of a new equivalence relation among conjunctive queries, the isomorphism modulo a product. We prove that the equivalence of aggregate queries such as for instance averages reduces to it. Deciding if two queries are isomorphic modulo a product is shown to be NP-complete. We then show that the problem of complete rewriting of count queries using count views is also NP-complete. Finally, we introduce new rewriting techniques based on the isomorphism modulo a product to recover the values of counts by complex arithmetical computation from the views. We conclude by showing how these techniques can be used to perform automatic aggregation.

The Design and Implementation of DEDALE

Michel Scholl, Stephane Grumbach, Philippe Rigaux, Luc Segoufin- 1999

This paper introduces DEDALE, one of the first implementations of a database system based on the linear constraint model. DEDALE is a long term project started in 1996 as a cooperation between the VERSO group at INRIA and the VERTIGO group at CNAM. DEDALE is intended to demonstrate the practical relevance of this model to handle geometric applications in various areas such as GIS or spatio-temporal databases. The system represents and manipulates multi-dimensional pointsets as linear constraint relations which are nested into classical relational tuples. Although the (global) dimension is arbitrary, the orthographic dimension (orthodim)is at most~2. This restriction allows for a trade-off between efficiency of evaluation and representation power.

Using LDAP Directory Caches

Sophie Cluet, Olga Kapitskaia, Divesh Srivastava- 1999

LDAP (Lightweight Directory Access Protocol) directories have recently proliferated with the growth of the Internet, and are being used in a wide variety of network-based applications to store data such as personal profiles, address books, and network and service policies. These systems provide a means for managing heterogeneity in a way far superior to what conventional relational or object-oriented databases can offer. To achieve fast performance for declarative query answering, it is desirable to use client caching based on semantic information (instead of individual directory entries). We formally consider the problem of reusing cached LDAP directory entries for answering declarative LDAP queries. A semantic LDAP directory cache contains directory entries, which are semantically described by a set of query templates. We show that, for conjunctive queries and LDAP directory caches with positive templates, the complexity of cache-answerability is NP-complete in the size of the query. For this case, we design a sound and complete algorithm for cache-answerability based on a suite of query transformations that capture the semantics of LDAP queries. We demonstrate the practicality of this algorithm for real applications with a performance evaluation, based on sample queries from a directory enabled application at AT&T Labs. When the query templates in the cache contain negation, we show that the complexity of cache-answerability of conjunctive LDAP queries is co-NP complete in the size of the schema and query templates in the semantic description of the cache. Finally, we identify natural restrictions on the nature of the semantic descriptions for polynomial-time cache-answerability.

Views and XML

Serge Abiteboul- 1999

The notion of views is essential in databases. It allows various users to see data from different viewpoints. In the present paper, we informally present works of the author on the topic. Instead of addressing the issue of views in a classical database setting, the paper focuses on XML and looks at these various works in the context of a system of views for XML.

XML repository and Active Views Demonstration

Serge Abiteboul, Bernd Amann, Sophie Cluet, Tova Milo, Cassio Souza Dos Santos, Anne-Marie Vercoustre, Brendan Hills, Laurent Mignet, Vincent Aguilera, Sebastien Ailleret, Frederic Hubert, Jean-Claude Mamou, Amelie Marian, Bruno Tessier- 1999

The goal of this demonstration is to present the main features of (i) Axielle, an XML repository developed by Ardent Software on top of the O$_2$ object-oriented DBMS and (ii) the ActiveView system which has been built by the Verso project at INRIA on top of Axielle. The demonstration is based on a simple electronic commerce application which will be described in Section 2. Electronic commerce is emerging as a major Web-supported application. It involves handling and exchange of data (e.g. product catalogs, yellow pages, etc.) and must provide (i) database functionalities (query language, transactions, concurrency control, distribution and recovery) for the efficient management of large data volumes and hundreds of users as well as (ii) standard data storage and exchange formats (e.g. XML, SGML) for the easy integration of existing software and data. The ActiveView system combined with the Axielle XML repository enables a fast deployment of electronic commerce applications based on a new high-level declarative specification language (AVL), advanced database technology (object-oriented data model, XML query language, notifications), Web standards (HTTP, HTML) and other Internet compliant technologies (Java, RMI).

A Document-driven Approach to Database Report Generation

Daniel K.C. Chan- 1998

It can be argued that report generation is the most frequently performed task in database applications. Therefore the efficacy and efficiency of a database report generation mechanism has a significant impact on productivity. This paper introduces a document-driven approach as opposed to the traditional schema-driven approach to report generation resulting in more flexibility. In this approach, generating a report involves specifying it in terms of a user defined SGML-based report model. Contents of a report can be specified using a transformation language together with queries that retrieve data from different databases. A report is output as a SGML document which can be further edited as well as translated to other formats, for instance to HTML to be viewed using an Internet browser. This paper presents the approach using an example and discusses the features and usage of the transformation language which is a small but expressive language. Despite of the fact that we have only investigated the transformation against relational databases, we believe that this approach can unify report generation for different database models.

A Language for Publishing Virtual Documents on the Web

Anne-Marie Vercoustre, Fran Paradis- 1998

The Web is creating exciting new possibilities for direct and instantaneous publishing of information. However, the apparent ease with which one can publish documents on the Web hides more complex issues such updating and maintaining Web pages. We believe one of the crucial requirements to document delivery is the ability to extract and reuse information from other documents or sources. In this paper we present a descriptive language that allows users to write virtual documents, where dynamic information can be retrieved from various sources, transformed, and included along with static information in HTML documents. The language uses a tree-like structure for the representation of information, and defines a database-like query language for extracting and combining information without a complete knowledge of the structure or the types of information. The data structures and the syntax of the language are presented along with examples.

A Logical View of Structured Files

Serge Abiteboul, Sophie Cluet, Tova Milo- 1998

Structured data stored in files can benefit from standard database technology. In particular, we show here how such data can be queried and updated using declarative database languages. We introduce the notion of structuring schema which consists of a grammar annotated with database programs. Based on a structuring schema, a file can be viewed as a database structure, queried and updated as such. For queries, we show that almost standard database optimization techniques can be used to answer queries without having to construct the entire database. For updates, we study in depth the propagation to the file of an update specified on the database view of this file. The problem is infeasible in general and we present a number of negative results. The positive results consist of techniques that allow to propagate updates efficiently under some reasonable locality conditions on the structuring schemas.

A Specification Language for the WIDE Workflow Model

Daniel Chan, Jochem Vonk, Gabriel Sanchez, Paul Grefen, Peter Apers- 1998

The workflow paradigm is advancing the scope of process modelling by supporting complex control flows, rich process structures, and the integration of heterogeneous and legacy systems. However, studies of the paradigm have been seriously hindered by the lack of a concrete framework. This paper presents such a framework in the form of a workflow specification language. Unlike many other proposals, the specification language captures all the fundamental elements of the workflow paradigm at an abstraction level that is suitable for workflow application designers. More specifically, it provides a rich organisation model that can capture distribution using domains, an information model that also records presentation details, a sophisticated process model that supports active rules and workflow transactions, and the relationships between the three models. This paper presents the essential elements of the specification language and illustrates them with a comprehensive example. Workflow application developers should find the specification language a useful and compact means to capture and investigate design details. Workflow system developers would discover the language a good vehicle to study the interaction between different features as well as facilitate the development of more advanced features. Others would attain a better understanding of the workflow paradigm and could use the language as a basis of evaluation for the functionality of workflow systems.

A Virtual Document Interpreter for Reuse of Information

Anne-Marie Vercoustre, Fran Paradis, Brendan Hills- 1998

The importance of reuse of information is well recognised for electronic publishing. However, it is rarely achieved satisfactorily because of the complexity of the task: integrating different formats, handling updates of information, addressing document author's need for intuitiveness and simplicity, etc. An approach which addresses these problems is to dynamically generate and update documents through a descriptive definition of emph{virtual documents}. In this paper we present a document interpreter that allows gathering information from multiple sources, and combining it dynamically to produce a virtual document. Two strengths of our approach are: the generic information objects that we use, which enables access to distributed, heterogeneous data sources; and the interpreter's evaluation strategy, which permits a minimum of re-evaluation of the information objects from the data sources.

Active Views for Electronic Commerce

Serge Abiteboul, Victor Vianu, Bernd Amann, Sophie Cluet, Tova Milo- 1998

Electronic commerce is emerging as a major Web-supported application. We use object database technology as the backbone for a wide range of electronic commerce applications. More precisely, we present a system named ActiveView built on the O2 object database system. The ActiveView system uses extensively database features such as views or active rules (triggers). It provides a sophisticated access control mechanism and the means of logging certain specified events to trace the activity of an application. Also, the ActiveView system uses the Spocus framework to capture the workflow component based on some simple and limited set of rules. We use examples to show that the system thereby obtained is appropriate for electronic commerce applications involving interactions between individuals over the network. More precisely, we show how such an application can be declaratively specified, how it is run by the system, and how its activity can be controled and traced.

Answering queries using materialized views

Serge Abiteboul, O. Duschka- 1998

We study the data complexity of the problem of computing certain answers given view definitions, an instance of the views, and a query. We consider conjunctive queries, conjunctive queries with inequality, positive queries, datalog, and first order queries as query and view definition languages. We show that the choice between the assumption that views are complete and the assumption that some tuples might be missing has a considerable impact on the complexity of the problem. Our results imply that in many cases datalog programs are not expressive enough to answer queries using views in a best possible way. Finally, we examine the complexity of two related problems: view self-maintainability and view consistency.

Benchmarking Spatial Joins A La Carte

Michel Scholl, Philippe Picouet, Vincent Oria, Jean-Marc Saglio, Oliver Guenther- 1998

This paper addresses the issue of benchmarking spatial joins. For this purpose, we first present a web based benchmark generator to produce sets of rectangles. Second, using this generator and a set of statistical models, we define several tests to compare the performance of three spatial join algorithms. Our results show that the relative performance of the different techniques mainly depends on sample size and selctivity of the join predicate.

Bulk Loading Techniques for Object Databases and an Application to Relational Data

Sophie Cluet, Claude Delobel, Sihem Amer-Yahia- 1998

We present a framework for designing, in a declarative and flexible way, efficient migration programs and an undergoing implementation of a migration tool called RelOO whose targets are any ODBC compliant system on the relational side and the O2 system on the object side. The framework consists of (i) a declarative language to specify the migration process and (ii) an algebra-based program rewriting technique. The main originality of the language resides in the fact that it does not only provide the means to specify the database transformations from relations to objects, but also physical properties we want to achieve on the object side (such as clustering and sorting). The originality of the rewriting technique relies in its ability to optimize the migration processing time while taking into account physical properties and transaction decomposition. To achieve this, we introduce equivalences of a new kind that consider side-effects on the object database.

den, Germany, August

Anne-Marie Vercoustre, Craig A. Lindley- 1998

The FRAMES project is developing a system for video database search, content-based retrieval, and virtual video program synthesis. For dynamic synthesis applications, a video program is specified at a high level using a virtual video prescription. The prescription is a document expressed in a semi-formal language specifying the video structure, including formal specifications for direct reference to specific video components, components specified and retrieved by parametric query, and components generated as associative chains based upon an initial specification. This paper describes the syntax and semantics of the specification language for associative chaining. The language allows the specification of entity types for association generation, the specification of conjunctions and disjunctions of entity types, initial values, fixed constraints, and weights upon entity types. A short example is presented to illustrate the use of the language and how it may be used to generate different video sequences from a common database of video components.

Designing OQL: Allowing Objects to be Queried

Sophie Cluet- 1998

This paper tells the story of OQL, the standard query language of the Object Database Management Group (ODMG). The story starts in 1988, at INRIA in the Alta

Extracting schema from semistructured data

Serge Abiteboul, Svetlazar Nestorov, Rajeev Motwani- 1998

Of late, especially with the advent of the world-wide web, there has been an increasing interest in semistructured data in the database community. Semistructured data is characterized by the lack of any fixed and rigid schema, although typically the data has some implicit structure. While the lack of fixed schema makes extracting semistructured data fairly easy and an attractive goal, presenting and querying such data is greatly impaired. Thus, a critical problem is the discovery of the structure implicit in semistructured data and, subsequently, the recasting of the raw data in terms of this structure. In this paper, we consider a very general form of semistructured data based on labeled, directed graphs. We show that such data can be typed using the greatest fixpoint semantics of monadic datalog programs. We present an algorithm for approximate typing of semistructured data. We establish that the general problem of finding an optimal such typing is NP-hard, but present some heuristics and techniques based on clustering that allow efficient and near-optimal treatment of the problem. We also present some preliminary experimental results.

Fusion Query Optimization

Serge Abiteboul, Yannis Papakonstantinou, Hector Molina, Ramana Yerneni- 1998

Fusion queries search for information integrated from distributed, autonomous sources over the Internet. In this context, data is not cleanly fragmented as in traditional distributed databases, and the number of sources participating in a typical query is large. We investigate techniques for efficient processing of fusion queries. First, we focus on a very wide class of query plans that capture the spirit of many techniques usually considered in existing systems. We show how to efficiently find, under various realistic scenarios, good query plans within this large class. We evaluate the performance of these plans and provide additional heuristics that, by considering plans outside our target class of plans, yield further performance improvements.

Incremental maintenance for materialized views over semistructured data

Serge Abiteboul, Janet Wiener, Jason Mc Hugh, Michael Rys, Vassilis Vassalos- 1998

Semistructured data is not strictly typed like relational or object-oriented data and may be irregular or incomplete. It often arises in practice, e.g., when heterogeneous data sources are integrated or data is taken from the World Wide Web. Views over semistructured data can be used to ,lter the data and to restructure (or provide structure to) it. To achieve fast query response time, these views are often materialized. This paper studies incremental maintenance techniques for materialized views over semistructured data. We use the graph-based data model OEM and the query language Lorel, developed at Stanford, as the framework for our work. We propose a new algorithm that produces a set of queries that compute the changes to the view based upon a change to the source. We develop an analytic cost model and compare the cost of executing our incremental maintenance algorithm to that of recomputing the view. We show that for nearly all types of database updates, it is more e

Liaison: a Workflow Model for Novel Applications

Daniel Chan, Karl Leung, Ching Ying Yan, Keith Chan- 1998

The workflow paradigm is advancing the scope of process modelling by supporting rich process structures as well as resource allocation through the use of an organisation model. Its initial success has attracted a lot of interests in applying it to applications that are more complex than what can be handled by the current generation of workflow management systems. This paper introduces a new workflow model called Liaison that is designed to address the needs of these novel applications. More specifically, it provides a rich organisation model, sophisticated activity assignment constraints, an information model that also records presentation details, and a dynamic process model. An implemented prototype is described together with an examination of desirable improvements.

Query Flocks: An Extension to Association-Rule Mining

Serge Abiteboul, S. Nestorov, R. Motwani, S. Tsur, J. D. Ullman, C. Clifton, A. Rosenthal- 1998

Association-rule mining has proved a highly successful technique for extracting useful information from very large databases. This success is attributed not only to the appropriateness of the objectives, but to the fact that a number of new query-optimization ideas, such as the "a-priori" trick, make association-rule mining run much faster than might be expected. In this paper we see that the same tricks can be extended to a much more general context, allowing efficient mining of very large databases for many different kinds of patterns. The general idea, called "query flocks," is a generate-and-test model for data-mining problems. We show how the idea can be used either in a general-purpose mining system or in a next generation of conventional query optimizers.

Querying Spatial Databases via Topological Invariants

Victor Vianu, Luc Segoufin- 1998

The paper investigates the use of topological annotations (called topological invariants) to answer topological queries in spatial databases. The focus is on the translation of topological queries against the spatial database into queries against the topological invariant. The languages considered are first-order on the spatial database side, and fixpoint and first-order on the topological invariant side. In particular, it is shown that fixpoint expresses precisely the ptime queries on topological invariants.

Reflective Relational Machine

Serge Abiteboul, Victor Vianu, C.H. Papadimitriou- 1998

We propose a model of database programming with reflection (dynamic generation of queries within the host programming language), called the reflective relational machine, and characterize the power of this machine in terms of known complexity classes. In particular, the polynomial-time restriction of the reflective relational machine is shown to express PSPACE, and to correspond precisely to uniform circuits of polynomial depth and exponential size. This provides an alternative, logic-based formulation of the uniform circuit model, that may be more convenient for problems naturally formulated in logic terms, and establishes that reflection allows for more "intense" parallelism, which is not attainable otherwise (unless P = PSPACE). We also explore the power of the reflective relational machine subject to restrictions on the number of variables used, emphasizing the case of sublinear bounds.

Relational transducers for electronic commerce

Serge Abiteboul, Victor Vianu, Brad Fordham, Yelena Yesha- 1998

Electronic commerce is emerging as one of the major Websupported applications requiring database support. We introduce and study high-level declarative speci,cations of business models, using an approach in the spirit of active databases. More precisely, business models are speci,ed as relational transducers that map sequences of input relations into sequences of output relations. The semantically meaningful trace of an input-output exchange is kept as a sequence of log relations. We consider problems motivated by electronic commerce applications, such as log validation, verifying temporal properties of transducers, and comparing two relational transducers. Positive results are obtained for a restricted class of relational transducers called Spocus transducers (for semi-positive outputs and cumulative state). We argue that despite the restrictions, these capture a wide range of practically signi,cant business models.

Representing and Querying Changes in Semistructured Data

Serge Abiteboul, S. Chawathe, J. Widom- 1998

Semistructured data may be irregular and incomplete and does not necessarily conform to a fixed schema. As with structured data, it is often desirable to maintain a history of changes to data, and to query over both the data and the changes. Representing and querying changes in semistructured data is more difficult than in structured data due to the irregularity and lack of schema. We present a model for representing changes in semistructured data and a language for querying over these changes. We discuss implementation strategies for our model and query language. We also describe the design and implementation of a "query subscription service" that permits standing queries over changes in semistructured information sources.

Reuse of Linked Documents through Virtual

Anne-Marie Vercoustre, Fran Paradis- 1998

As the WWW becomes a major source of information, a lot of interest has arisen, not only for searching for information, but for reusing this information in new pages or directly from applications. Unfortunately HTML tags do not provide a significant level of structure for identifying and extracting information, since they are mostly used for presentation issues. Moreover the simple link mechanism of the Web does not support the controlled traversal of links to related pages. Particularly promising is the proposal for a new standard, XML, which could bring the power of SGML to the Web while keeping the simplicity of HTML. In this paper we present a system and a language that allows reusing information from various sources, including databases and SGML-like documents, by combining it dynamically to produce a virtual document. The language uses a tree-like structure for the representation of information objects as well as links between objects. The paper focuses on the selection and the traversal of XML links to extract information from linked pages. The strength of our approach is to be an SGML-compliant solution, which makes it ready to take full advantage of XML for reusing information from the Web as soon as it is widely used.

Spatio-Temporal Data Handling with Constraints

Stephane Grumbach, Philippe Rigaux, Luc Segoufin- IEEE Transactions on Knowledge and Data Engineering 1998

Most spatial information systems are limited to a fixeddimension (generally 2) which is not extensible. On the other hand, the emerging paradigm of constraint databasesallows the representation of data of arbitrary dimension, togetherwith abstract querylanguages. The complexity of evaluating queries though might be costlyif the dimension of the objects is really arbitrary.In this paper, we present a data model, based on linear constraints, dedicated to the representationand manipulation of multidimensional data.In order to preserve a low complexity for query evaluation,we introduce the structural dimension of an object $O$,as the dimension of the components $O_1,...,O_n$, such that$O=O_1imescdotsimes O_n$. This allows to processqueries independently on each component, therefore achieving asatisfying trade-off between design simplicity, expressive power ofthe query language and efficiency of query evaluation.We illustrate these concepts in the context of spatio-temporaldatabases where {em space} and {em time} are the natural components.This data model has been implemented in the dedale/ system and aspatio-temporal application, with structural dimension 2, is currentlyrunning, thus showing the practical relevance of the approach.

The Aquarelle resource discovery system

Michel Scholl, Vassilis Christophides, Anne-Marie Vercoustre, Alain Michard, Mike Stapleton, Dale Sutcliffe- 1998

Aquarelle is a three-year project supported by the Telematics Applications Programme of the European Union, aiming at designing a resource discovery system on the Internet, applied to cultural heritage documentation. The system relies on the Z39.50 protocol to support access to heterogeneous databases, including SGML document repositories. Its most original features are direct linking from SGML documents to database records, an advanced link management facility, and query broadcasting to dynamically selected databases.

The Dedale System for Complex Spatial Queries

Stephane Grumbach, Philippe Rigaux, Luc Segoufin- 1998

This paper presents Dedale, a spatial database system intended to overcome some limitations of current systems by providing an abstract and non-specialized data model and query language for the representation and manipulation of spatial objects. Dedale relies on a logical model based on linear constraints, which generalizes the constraint database model of [KKR90]. While in the classical constraint model, spatial data is always decomposed into its convex components, in Dedale holes are allowed to fit the need of practical applications. The logical representation of spatial data although slightly more costly in memory, has the advantage of simplifying the algorithms. Dedale relies on nested relations, in which all sorts of data (thematic, spatial, etc.) are stored in a uniform fashion. This new data model supports declarative query languages, which allow an intuitive and efficient manipulation of spatial objects. Their formal foundation constitutes a basis for practical query optimization. We describe several evaluation rules tailored for geometric data and give the specification of an optimizer module for spatial queries. Except for the latter module, the system has been fully implemented upon the O2 DBMS, thus proving the effectiveness of a constraint-based approach for the design of spatial database systems.

The dedale System for Complex Spatial Queries.

Stephane Grumbach, Philippe Rigaux, Luc Segoufin- 1998

This paper presents dedale, a spatial database system intended to overcome some limitations of current systems by providing an abstract and non-specialized data model and query language for the representation and manipulation of spatial objects. dedale relies on a logical model based on linear constraints, which generalizes the constraint database model of [KKR90]. While in the classical constraint model, spatial data is always decomposed into its convex components, in dedale holes are allowed to fit the need of practical applications. The logical representation of spatial data although slightly more costly in memory, has the advantage of simplifying the algorithms. dedale relies on nested relations, in which all sorts of data (thematic, spatial, etc.) are stored in a uniform fashion. This new data model supports declarative query languages, which allow an intuitive and efficient manipulation of spatial objects. Their formal foundation constitutes a basis for practical query optimization. We describe several evaluation rules tailored for geometric data and give the specification of an optimizer module for spatial queries. Except for the latter module, the system has been fully implemented upon the O2 DBMS, thus proving the effectiveness of a constraint-based approch for the design of spatial database systems.

Using YAT to Build a Web Server

Sophie Cluet, Jerome Simeon- 1998

Integration of heterogeneous data sources in a Web environment has become a major concern of the database community. Architectures, data models and query languages have been proposed but the complementary problem of data conversion has been less studied. The YAT system provides a means to build software components based on data conversion, such as wrappers or mediators, in a simple and declarative way. We show that the YAT system can also be used to create integrated Web views over heterogeneous data sources very easily. Only minor changes were required for YAT to provide data integration (as opposed to data conversion) in a Web environment. Finaly, we report on our experience while building the Verso Web site using YAT.

Valmont: a Language for Workflow Programming

Daniel Chan, Karl Leung- 1998

The workflow paradigm is advancing the scope of process modelling by supporting complex control flows, rich process structures, and the integration of heterogeneous and legacy systems. However, studies of the paradigm have been seriously hindered by the lack of a concrete framework. In order to improve understanding of the paradigm, enable studying of interaction between different features, and facilitate the development of more advanced features, this paper presents a framework in the form of a workflow language called Valmont. Unlike many other proposals, Valmont captures all the fundamental elements of the workflow paradigm at an abstraction level that is suitable for workflow application designers. More specifically, it provides a rich organisation model that can capture distribution using domains, an information model that also records presentation details, as well as a sophisticated process model that supports active rules and workflow transactions. This paper presents an overview of Valmont by providing a solution to the ISPW-6 Software Process Example. It also serves to provide an insight into the needs of novel application domains some of which remain to be addressed by the next generation of workflow management systems.

Virtual Document Models for Intelligent Video

Anne-Marie Vercoustre, Craig Lindley- 1998

To facilitate maximum reuse of video data, well-specified and systematic methods are required for the incorporation of digital video components into multiple delivery products. This paper presents an approach to the intelligent synthesis of digital video products by the representation of virtual videos using SGML. Virtual videos are defined using a Virtual Video Prescription to specify a generic model of a video production that includes embedded queries for the dynamic retrieval of video content from underlying databases. An achitecture for virtual video synthesis is decribed, and a detailed example is presented to demonstrate the prescription language syntax and its underlying concepts. Virtual Video Prescriptions for various video types and genres can include increasingly complex video models and composition rules, leading to a synthesis of SGML-based document technology with knowledge base systems technology.

Your Mediators Need Data Conversion!

Sophie Cluet, Claude Delobel, Jerome Simeon, Katarzyna Smaga- 1998

Due to the development of the World Wide Web, the integration of heterogeneous data sources has become a major concern of the database community. Appropriate architectures and query languages have been proposed. Yet, the problem of data conversion which is essential for the development of mediators/wrappers architectures has remained largely unexplored. In this paper, we present the YAT system for data conversion. This system provides tools for the specification and the implementation of data conversions among heterogeneous data sources. It relies on a middleware model, a declarative language, a customization mechanism and a graphical interface. The model is based on named trees with ordered and labeled nodes. Like semistructured data models, it is simple enough to facilitate the representation of any data. Its main originality is that it allows to reason at various levels of representation. The YAT conversion language (called YATL) is declarative, rule-based and features enhanced pattern matching facilities and powerful restructuring primitives. It allows to preserve or reconstruct the order of collections. The customization mechanism relies on program instantiations: an existing program may be instantiated into a more specific one, and then easily modified. We also present the architecture, implementation and practical use of the YAT prototype, currently under evaluation within the OPAL project.

A Descriptive Language for Information Object Reuse through Virtual Documents

Anne-Marie Vercoustre, F. Paradis- 1997

The importance of reuse is well recognised for electronic document writing. However, it is rarely achieved satisfactorily because of the complexity of the task: integrating different formats, handling updates of information, addressing document author's need for intuitiveness and simplicity, etc. In this paper, we present a language for information reuse that allows users to write virtual documents, where dynamic information objects can be retrieved from various sources, transformed, and included along with static information in SGML documents. The language uses a tree-like structure for the representation of information objects, and allows querying without a complete knowledge of the structure or the types of information. The data structures and the syntax of the language are presented through an example application. A major strength of our approach is to treat the document as a non-monolithic set of reusable information objects.

A Processing Framework for Object Comprehensions

Daniel Chan, Phil Trinder- 1997

In order to execute database queries efficiently, they are usually translated into an algebra: a small set of operations together with some equivalences. The operations are implemented very efficiently by the database engine, and the equivalences are used to optimise the query. Since query languages for object-oriented databases are more complex than their relational counterparts, the algebra and translation, or extit{processing framework}, is more elaborate. This paper describes the processing framework for object comprehensions, a powerful object-oriented query language. Both operations and equivalences of the algebra are defined, together with a translation into it. The framework is illustrated by translating and optimising an example object comprehension query.

Benchmarking Spatial Joins A la Carte

Michel Scholl, Philippe Picouet, Oliver Gunther, Vincent Oria, Jean-Marc Saglio- 1997

Abstract Spatial joins are join operations that involve spatial data types and operators. Spatial access methods are often used to speed up the computation of spatial joins. This paper addresses the issue of benchmarking spatial join operations. For this purpose, we first present a WWW-based tool to produce sets of rectangles. Experimentators can use a standard Web browser to specify the number of rectangles, as well as the statistical distributions of their sizes, shapes, and locations. Second, using the rectangle generator and a well-defined set of statistical models we defined several test suites to compare the performance of three spatial join algorithms: nested loop, scan-and-index, and synchronized tree traversal. We also added a real life data set, the Sequoia 2000 storage benchmark. Our results confirm that the use of spatial indices leads to performance gains of several orders of magnitude. The tests also show that highly selective join predicates enjoy greater performance gains, and vice versa. All of the statistical models and algorithms are available on the Web, which allows for easy verification and modification of our experiments.

Correspondence and Translation for Heterogeneous Data

Serge Abiteboul, Sophie Cluet, Tova Milo- 1997

A primary motivation for new database technology is to provide support for the broad spectrum of multimedia data available notably through the network. These data are stored under different formats: SQL or ODMG (in databases), SGML or LaTex (documents), DX formats (scientific data), Step (CAD/CAM data), etc. Their integration is a very active field of research and development (see for instance, for a very small sample, [10, 6, 7, 9, 8, 12, 19, 20]). In this paper, we provide a formal foundation to facilitate the integration of such heterogeneous data and the maintenance of heterogeneous replicated data.

Databases and Finite Model Theory

Victor Vianu- 1997

Databases provide one of the main concrete scenarios for finitemodel theory within computer science. This paper presents an informal overview of database theory aimed at finite-model theorists, emphasizing the specificity of the database area. It is argued that the area of databases is a rich source of questions and vitality for finite-model theory.

Dedale : A Spatial Constraint Database

Michel Scholl, Stephane Grumbach, Philippe Rigaux, Luc Segoufin- 1997

This paper presents a first prototype of a constraint database for spatial information, dedale. Implemented on top of the O2 DBMS, data is stored in an object-oriented framework, with spatial data represented using linear constraints over a dense domain. The query language is the standard OQL, with special functions for constraint solving and geometric operations. A simple geographical application from the French Institute for Geography, IGN, is running on dedale. The data initially in vector mode was loaded into the database after a translation to constraint representation. Although it is too early to speak of performance since not all operations have been optimized yet, our experience with dedale demonstrates already the advantages of the constraint approach for spatial manipulation.

Evolving Databases: An Application to Electronic Commerce

Serge Abiteboul, B. Fordham, Y. Yesha- 1997

Many complex and dynamic database applications such as product modeling and negotiation monitoring require a number of features that have been adopted in semantic models and databases such as active rules, constraints, inheritance, etc. Unfortunately, each feature has largely been considered in isolation. Furthermore, in a commercial negotiation, participants staking their financial wellbeings will never accept a system they cannot gain a precise behavioral understanding of. We attack these problems with a rich and extensible database model, evolving databases, with a clear and precise semantics based on evolving algebras [5]. We also briefly describe a prototype implementation of the model [12].

Expressiveness and Complexity of Active Databases

Victor Vianu, Philippe Picouet- 1997

The expressiveness and complexity of several active database prototypes are formally studied. First, a generic framework for the specification of active databases is developed. This factors out the common aspects of the prototypes considered, and allows studying various active database features independently of any specific prototype. Furthermore, each of the prototypes can be specified by specializing certain parameters of the framework. The prototypes considered are ARDL, HiPAC, Postgres, Starburst, and Sybase. Using their formal specifications, the prototypes are compared to each other with respect to expressive power. The results provide insight into the programming paradigm of active databases, the interplay of various features, and their impact on expressiveness and complexity.

Inferring Structure in Semistructured Data

Serge Abiteboul, S. Nestorov, R. Motwani- 1997

When dealing with semistructured data such as that available on the Web, it becomes important to infer the inherent structure, both for the user (e.g., to make querying easier) and for the system (e.g., to optimizeaccess). In this paper, we consider the problem of identifying some underlying structure in large collections of semistructured data. Since we expect the data to be fairly irregular, this structure consists of an approximate classification of objects into an hierarchical collection of types. We propose a notion of a type hierarchy for such data, an algorithm for deriving the type hierarchy, and rules for assigning types to data elements.

Lore: A Database Management System for Semistructured Data

Serge Abiteboul, J. Widom, J. McHugh, R. Goldman, D. Quass- 1997

Lore (for Lightweight Object Repository) is a DBMS designed specifically for managing semistructured information. Implementing Lore has required rethinking all aspects of a DBMS, including storage management, indexing, query processing and optimization, and user interfaces. This paper provides an overview of these aspects of the Lore system, as well as other novel features such as dynamic structural summaries and seamless access to data from external sources.

Modeling and Querying Semi-Structured Data

Sophie Cluet- 1997

Since its creation in 1989, the Web has known a tremendous success due to its capacity to deliver very inexpensively information world wide.

Object Database Support for Digital Libraries

Serge Abiteboul- 1997

In this paper, we discuss some aspects of database support for digital libraries. From a DL perspective, database systems, and in particular, object database systems provide a nice basis for future DL systems. More generally, database research provides solutions to many DL issues even if these are partial or fragmented. When possible, work should not be duplicated and good software and ideas should be reused. From a DB perspective, we want to stress that digital libraries propose beautiful applications and challenges to DBMS technology. They suggest a number of improvements to DBMSs that could be beneficial beyond DL applications.

Object Views and Database Restructuring

Claude Delobel, Zoe Lacroix, Philippe Breche- 1997

We present a formal data model for views in Object DataBase Systems (ODBS) which is a continuation of the work presented in [AK89, AB91, dSDA94, dS95a]. The main contribution of this paper is the formalization of views as a transformation mechanism for databases using a generalization of referent proposed in [dS95b]. We define an IQL-like language adapted to manipulate referents. The view-based transformation is achieved in two steps: an extension of the source base followed by a projection of the extended base. The extension and projection can be carried out using four object algebraic operators that specify both the virtual schema and its corresponding virtual instance. This simple algebra can express all the view operators proposed in [KR96, LS92, dS95b, KRR95, CTR96, KR96].

Object Views and Updates

Cassio Souza Dos Santos, Philippe Breche, Sihem Amer-Yahia- 1997

We present a mechanism for updating views in Object DataBase Systems (ODBS). Starting from an existing view data model, chronologically in line with [1, 2, 14], we extend it to provide data update ability. Updates to data views are performed in such a way that consistency with root data is preserved. The main contribution of this paper is to detail such a mechanism together with its corresponding data model. This is, to the best of our knowledge, the first implementation of such a mechanism in the ODBS framework (as opposed to the RDBS framework). Since this view mechanism is ODMG 93 compliant - views are specified with OQL queries extended by operations on attributes (attribute hiding, renaming and derivation) - it may be implemented on top of various ODBSs. We discuss issues on implementation on top of the O2 ODBS and consider related work. Various applications of updatable views conclude the article.

Object Views Constructed with an Object Algebra

Claude Delobel, Zoe Lacroix, Philippe Breche- 1997

We present a formal data model for views in Object DataBase Systems (ODBS) which is a continuation and a formalization of the work presented in [AK89, AB91, ADdS94, dS95a]. The main contribution of this paper is the formalization of the views as a transformation mechanism for databases using a generalization of referent proposed in [dS95b]. The view-based transformation is achieved in two steps: an extension of the source base followed by a projection of the extended base. The extension and projection can be carried out using five object algebraic operators which are described in detail in this paper. Equally important, this paper highlights the challenges involved supporting our view mechanism. We show how this basic model for views point to new perspectives, namely, extension of OQL, transformation of query on the view in a query on the source schema and base, and finally characterization of the class of view-transformations our algebra captures.

On Non-Determinism in Machines and Languages

Stephane Grumbach, Zoe Lacroix- 1997

Non-deterministic computation is not really computation, but the difference with real computation is blurred. We study in detail various levels of non-determinism in computations on non-deterministic Turing machines with polynomial bounds on the resources. Meanwhile, we consider numerous query languages, implicit logic, choice logic, order invariant logic, and restrictions of second-order logic, and establish correspondences between all these formalisms for both deterministic and non-deterministic queries. To the degrees of non-determinism in the computations, correspond levels of non-determinism in the logical definitions. Our main contribution is to characterize various complexity classes between Ptime and Pspace, by several logical means, thus translating open questions in complexity theory to open questions in logic related to the use of the non-determinism.

Optimizing Generalized Path Expressions Using Full Text Indexes

Sophie Cluet, Vassilis Christophides, Guido Moerkotte, Jerome Simeon- 1997

Query languages for object bases became enriched by generalized path expressions that allow for attribute and path variables. Optimizing queries containing generalized path expressions attracted some interest. However, many interesting queries require still a full scan over the whole object base. This unbearable situation can be remedied best by utilizing index structures. However, traditional database indexes fail to support generalized path expressions. We propose to use a flexible mix of database and full text indexes in order to evaluate efficiently queries containing generalized path expressions. We introduce an algebraic framework for the optimization of such queries using full text indexes, and we report on a first prototype implementation including some performance results.

Queries and Computation on the Web

Serge Abiteboul, Victor Vianu- 1997

The paper introduces a model of the Web as an infinite, semistructured set of objects. We reconsider the classical notions of genericity and computability of queries in this new context and relate them to styles of computation prevalent on the Web, based on browsing and searching. We revisit several well-known declarative query languages (first-order logic, Datalog, and Datalog with negation) and consider their computational characteristics in terms the notions introduced in this paper. In particular, we are interested in languages or fragments thereof which can be implemented by browsing, or by browsing and searching combined. Surprisingly, stratified and well-founded semantics for negation turn out to have basic shortcomings in this context, while inflationary semantics emerges as an appealing alternative.

Queries with arithmetical constraints

Stephane Grumbach, Jianwen Su- 1997

In this paper, we study the expressive power and the complexity of first-order logic with arithmetic, as a query language over relational and constraint databases. We consider constraints over various domains (N, Z, Q, and R), and with various arithmetical operations (6, +, Theta , etc.). We first consider the data complexity of first-order queries. We prove in particular that linear queries can be evaluated in AC0 over finite integer databases, and in NC1 over linear constraint databases. This improves previously known bounds. We also show that over all domains, enough arithmetic lead to arithmetical queries, therefore, showing the frontiers of constraints for database purposes. We then tackle the problem of the expressive power, with the definability of the parity and the connectivity, which are the most classical examples of queries not expressible in first-order logic over finite structures. We prove that these two queries are first-order definable in presence of (enough) arithmetic. Nevertheless, we show that they are not definable with constraints of interest for constraint databases such as linear constraints for instance. Finally, we developed reduction techniques for queries over constraint databases, that allow us to draw conclusions with respect to their undefinability in various constraint query languages.

Querying Documents in Object Databases

Serge Abiteboul, Sophie Cluet, Tova Milo, Vassilis Christophides, Guido Moerkotte, Jerome Simeon- 1997

We consider the problem of storing and accessing documents (SGML and HTML, in particular) using database technology. To specify the database image of documents, we use structuring schemas that consist in grammars annotated with database programs. To query documents, we introduce an extension of OQL, the ODMG standard query language for object databases. Our extension (named OQL-doc) allows to query documents without a precise knowledge of their structure using in particular generalized path expressions and pattern matching. This allows us to introduce in a declarative language (in the style of SQL or OQL), navigational and information retrieval styles of accessing data. Query processing in the context of documents and path expressions leads to challenging implementation issues. We extend an object algebra with new operators to deal with generalized path expressions. We then consider two essential complementary optimization techniques: 1. we show that almost standard database optimization techniques can be used to answer queries without having to load the entire document into the database. 2. we also consider the interaction of full-text indexes (e.g., inverted files) with standard database collection indexes (e.g., B-trees) that provide important speedup. The paper is an overview of a research project at INRIA-Rocquencourt. Some particular aspects are detailed in [ACM93, CACS94, ACM95, CCM96].

Querying Semi-Structured Data

Serge Abiteboul- 1997

The amount of data of all kinds available electronically has increased dramatically in recent years. The data resides in different forms, ranging from unstructured data in file systems to highly structured in relational database systems. Data is accessible through a variety of interfaces including Web browsers, database query languages, application-specific interfaces, or data exchange formats. Some of this data is raw data, e.g., images or sound. Some of it has structure even if the structure is often implicit, and not as rigid or regular as that found in standard database systems. Sometimes the structure exists but has to be extracted from the data. Sometimes also it exists but we prefer to ignore it for certain purposes such as browsing. We call here semi-structured data this data that is (from a particular viewpoint) neither raw data nor strictly typed, i.e., not table-oriented as in a relational model or sorted-graph as in object databases.

Regular Path Queries with Constraints

Serge Abiteboul, Victor Vianu- 1997

The evaluation of path expression queries on semistructured data in a distributed asynchronous environment is considered. The focus is on the use of local information expressed in the form of path constraints in the optimization of path expression queries. In particular, decidability and complexity results on the implication problem for path constraints are established.

rence (ICSC), Hong Kong, China, p.282-291

Daniel Chan, Karl Leung- 1997

It is a general consensus that automated support for software development is essential to harness the ever increasing complexity of today's software. Many software development models, tools, and environments have been introduced to address such a need; however, they are usually methodology-specific and impose a rather authoritarian policy on the way software is developed. This paper advocates the use of workflow systems to enact the process of software development. Besides being more general and flexible, the workflow paradigm supports useful features lacking in other approaches. Also, it helps to reduce development complexity by allowing both the software development process and the software themselves to be captured using the very same paradigm. This paper introduces a workflow system being developed to support the software development process by presenting a solution to the ISPW-6 Software Process Example expressed in its specification language. This paper therefore serves two purposes: (1) to introduce a new and more general approach to software process enactment, and (2) to identify new requirements for the workflow paradigm, such as event dependency, that are applicable to many other advanced applications.

Reuse of Information through Virtual Documents

Anne-Marie Vercoustre, B. Hills- 1997

This paper explores the issue of representing textual information in the form of virtual documents that include data and fragments of documents from remote sources -- especially from databases and SGML document databases. Virtual documents are dynamically generated, and therefore always present up-todate information when they are instantiated.. The benefit of this paradigm is that it allows information to be shared, reused, and adapted for various contexts. A Virtual document specification defines how to find and retrieve information objects from databases or from existing documents, and how to assemble it into another document. Virtual documents can be used to create HTML pages that contain information from one or several remote or local databases, to assembly parts of existing documents into a new one, or to define various views of the same information according to various needs. This paper focuses on the prototype implementation of virtual documents from the perspectives of document authoring and architecture. We propose an SGML syntax for Information Object that includes OQL-queries for retrieving fragments of existing documents, transformations on an intermediate tree representation, and output mapping to the virtual document structures.

Views for semistructured data

Serge Abiteboul, J. McHugh, R. Goldman, V. Vassalos, Y. Zhuge- 1997

Defining a view over a semistructured database introduces many new problems. In this paper we propose a view specification language and consider the problem of answering queries posed over views. The two main approaches, query rewriting and view materialization, are outlined with focus on the difficult problems caused by the semistructured nature of the data.

Advanced Primitives for Changing Schemas of Object Oriented Databases

Philippe Breche- 1996

Currently, existing Object Database Systems (ODBSs) perform schema changes by means of primitives closely related to their respective data model. Software Engineering (SE) applications, Object Methodologies (OM) and designers building up object database schemata require a more abstract level. This paper addresses new facilities for updating a schema filling the gap between object design and object programming. A set of advanced primitives, so{called "High Level Primitives", is presented to cope with these requirements. The semantics of these new primitives and how to maintain a schema consistent after such a schema update are the main contributions of this paper. We discuss issues on implementation on top of the commercial ODBS O2 and consider related work.

Evaluating Queries with Generalized Path Expressions

Sophie Cluet, Vassilis Christophides, Guido Moerkotte- 1996

In the past few years, query languages featuring generalized path expressions have been proposed. These languages allow the interrogation of both data and structure. They are powerful and essential for a number of applications. However, until now, their evaluation has relied on a rather naive and ine

Management of Schema Versions and Versions of Schema Instance in a Multiversion Database

Marie-Jo Bellosta, Robert Wrembel, Genevieve Jomier- 1996

An object database is defined by a schema and an instance of the schema. The schema describes the structure of the data stored in the database, including classes, types associated with classes, inheritance relationships between classes, and method interfaces. The instance of the schema contains all objects stored in the database, and method implementations. Expressiveness of object database models is the main reason of their wide success in the CAD/CAM, CSCW and CASE applications. Moreover, these applications require database versioning in order to view and manage different states of the modeled universe. Such databases are called multiversion. The main problem of multiversion databases is to recognize which versions of different multiversion entities, i.e. classes or/and objects, are consistent, i.e., go together. As shown in [CJ90], a user is quickly lost if he needs to manage explicitly consistency between all versions of both classes and objects without some efficient support from the system.

Object Fusion in Mediator Systems

Serge Abiteboul, Yannis Papakonstantinou, Hector Garcia-Molina- 1996

One of the main tasks of mediators is to fuse information from heterogeneous information sources. This may involve, for example, removing redundancies, and resolving inconsistencies in favor of the most reliable source. The problem becomes harder when the sources are unstructured/semistructured and we do not have complete knowledge of their contents and structure. In this paper we show how many common fusion operations can be specified non-procedurally and succinctly. The key to our approach is to assign semantically meaningful object ids to objects as they are "imported" into the mediator. These semantic ids can then be used to specify how various objects are combined or merged into objects "exported" by the mediator. In this paper we also discuss the implementation of a mediation system based on these principles. In particular, we present key optimization techniques that significantly reduce the processing costs associated with information fusion.

Point and Window Queries with linear spatial indices : an evaluation with O2

Michel Scholl, Gilles Grangeret, Xavier Rehse- 1996

Geographic Information Systems (GIS) utilize Database Management Systems (DBMS) for storing and retrieving geo-referenced data. Such data is described by means of attributes defined on domains such as Integer, String, etc., as well as spatial attributes that represent the data geometry and location on earth. As conventional DBMS, GIS have to efficiently answer user queries, related to the objects location (geometry) and expressed with a spatial query language. Then one crucial challenge of GIS design is spatial query optimization.

Temporal versus First-Order Logic to Query Temporal Databases

Serge Abiteboul, Laurent Herr, Jan Van den Bussche- 1996

A database history can be modeled as a (finite) sequence of instances discretely ordered by time. Similarly, the behavior of a system such as an operating system or a reactive system can be modeled by an infinite such sequence. One can view the sequence as one single database where each relation has an additional column holding the time instant of validity of each tuple. The temporal database can then be queried using standard relational calculus (first-order logic) on this "timestamp" representation. One may alternatively use an implicit access to time and express queries in temporal logic. It is known that these two approaches yield the same expressive power in the propositional case. Their comparison in the predicate/database context remained open. We prove here that there are first-order logic queries on the timestamp representation that are not expressible in (extended) temporal logic. The proof technique is novel and is based on communication complexity.

The Lorel Query Language for Semistructured Data

Serge Abiteboul, Dallan Quass, Jason McHugh, Jennifer Widom, Janet Wiener- 1996

We present the Lorel language, designed for querying semistructured data. Semistructured data is becoming more and more prevalent, e.g., in structured documents such as HTML and when performing simple integration of data from multiple sources. Traditional data models and query languages are inappropriate, since semistructured data often is irregular, some data is missing, similar concepts are represented using different types, heterogeneous sets are present, or object structure is not fully known. Lorel is a user-friendly language in the SQL/OQL style for querying such data effectively. For wide applicability, the simple object model underlying Lorel can be viewed as an extension of ODMG and the language as an extension of OQL. The main novelties of the Lorel language are: (i) extensive use of coercion to relieve the user from the strict typing of OQL, which is inappropriate for semistructured data; and (ii) powerful path expressions, which permit a flexible form of declarative navigational access and are particularly suitable when the details of the structure are not known to the user. Lorel also includes a declarative update language. Lorel is implemented as the query language of the Lore prototype database management system at Stanford (see http://www-db.stanford.edu/lore). In addition to presenting the Lorel language in full, this paper briefly describes the Lore system and query processor. We also discuss how Lorel could be implemented on top of a conventional object-oriented database management system.

A Database Interface for File Updates

Serge Abiteboul, Sophie Cluet, Tova Milo- 1995

Database systems are concerned with structured data. Unfortunately, data is still often available in an unstructured manner (e.g., in files) even when it does have a strong internal structure (e.g., electronic documents or programs). In a previous paper [2], we focussed on the use of high-level query languages to access such files and developed optimization techniques to do so. In this paper, we consider how structured data stored in files can be updated using database update languages.

A probabilistic view of datalog parallelization

Victor Vianu, Sergio Lifschitz- 1995

We explore an approach to developing Datalog parallelization strategies that aims at good expected rather than worst-case performance. To illustrate, we consider a very simple parallelization strategy that applies to all Datalog programs. We prove that this has very good expected performance under equal distribution of inputs. This is done using an extension of 0-1 laws adapted to this context. The analysis is confirmed by experimental results on randomly generated data.

Compression et comprehension des sequences de nucleotides

Stephane Grumbach, Fariza Tahi- 1995

DNA sequences contain a high percentage of redundancies, such as "palindromic" repetitions, that universal compression algorithms fail to detect. We present the algorithm "Biocompress-2" to compress nucleotide sequences, which, to the best of our knowledge, is the most efficient algorithm on our corpus of DNA sequences. The rates are not satisfactory, but this technique leads to the discovery of relevant regularities, and presents a connection with the techniques of construction of phylogenetic trees and prediction of secondary structures of RNAs.

Computing Queries on Linear Constraint Databases

Stephane Grumbach, Zoe Lacroix- 1995

We consider rational linear constraint databases, and study the problem of evaluating efficiently first-order queries with linear constraints. We show that although queries admit low data complexity (NC), their naive evaluation is rather inefficient. The computation paradigm is based on relational algebra and constraint solving. We focus on the former, and show that the query processing should differ drastically from classical relational databases. In particular, optimization strategies used for relational queries may be inadequate. We describe the problem and propose some preliminary optimization principles.

Deep equality revisited

Serge Abiteboul, Jan Van den Bussche- 1995

We revisit the notion of deep equality among objects in an object database from a formal point of view. We present three natural formalizations of deep equality: one based on the infinite value-trees associated with objects, one based on the greatest fixpoint of an operator on equivalence relations among objects, and one based on indistinguishability of objects using observations of atomic values reachable from the objects. These three definitions are then shown to be equivalent. The characterization in terms of greatest fixpoints also yields a polynomial-time algorithm for checking deep equality. We also study the expressibility of deep equality in deductive database languages.

Design and Implementation of Object-Oriented Views

Cassio Souza Dos Santos- 1995

We discuss the design and implementation of the O2Views object-oriented database view mechanism, which allows the redefinition of both the structure and the behavior of objects stored in a database. The data model extended with views is given and the functionalities of the prototype implementing it are presented. The paper focuses on the requirements for the implementation of an object-oriented view mechanism, ranging from the conception of a view definition language to optimization strategies for querying and updating through views.

Efficient Evaluation of Aggregates on Bulk Types

Sophie Cluet, Guido Moerkotte- 1995

A new method for efficiently evaluating queries with aggregate functions is presented. More specifi- cally, we introduce a class of aggregate queries where traditional query evaluation strategies in general require O(n2) time and space in the size of the (at most two) input relations. For this class of aggregate queries our approach needs at most O(n log n) time and linear space. Further, our approach deals not only with relations but with general bulk types like sets, bags, and lists.

Evaluation of spatial indices implemented with the DBMS O2

Michel Scholl, J. P. Peloux, G. Reynal de St Michel- 1995

We report a measurement experiment for evaluating the access performance of point queries with several spatial data structures implemented with O2C, the development language of the object-oriented database management system (DBMS) O2 [O291]. Four data structures were implemented and tested: the first one is a quadtree structure [Sam90], the second one is an extension which turned out to be very close to Hierarchical Excell [Tam83]. The third and the fourth data structures are the well known tree structures called R*tree [BKSS90] and R+tree [SRF87]. The objective we had in mind was twofold: (i) contribute to the choice of spatial access methods to be implemented in geographic DBMS in order to accelerate spatial queries among which point and window queries are the most common; (ii) understand where the performance bottlenecks are when implementing such structures with commercial object-oriented DBMS. Our goal was not to find the best data structure but to understand the behavior of each data structure with an actual implementation.

Generalized Implicit Definitions on Finite Structures

Stephane Grumbach, Zoe Lacroix, Steven Lindell- 1995

We propose a natural generalization of the concept of implicit definitions over finite structures, allowing non-determinism at an intermediate level of a (deterministic) definition. These generalized implicit definitions offer more expressive power than classical implicit de,- nitions. Moreover, their expressive power can be characterized over unordered finite structures in terms of the complexity class NP " co-NP. Finally, we investigate a subclass of these where the non-determinism is restricted to the choice of a unique relation with respect to an implicit linear order, and prove that it captures UP " co-UP also over the class of all finite structures. These results shed some light on the expressive power of non-deterministic primitives.

How to Remove a Class in an Object Database System

Philippe Breche, M. Worner- 1995

In schema design for object database systems (ODBS) a designer requires advanced facilities to change a schema through its life cycle. Removing information embedded in a complex class hierarchy is one of these advanced schema update primitives. Consequently to such a schema change, databases might need updates, guided by designers' requests. Most of the ODBSs disallow class deletions or restrict it to leaf classes without objects or lose objects after the schema update so far. The main contribution of this paper is to present a declarative primitive, Remove, which copes with these requirements.

IQL(2) : A Model with Ubiquitous Objects

Serge Abiteboul, Cassio Souza Dos Santos- 1995

Object-oriented databases have brought major improvements in data modeling by introducing notions such as inheritance or methods. Extensions in many directions are now considered with introductions of many concepts such as versions, views or roles. These features bring the risk of creating monster data models with a number of incompatible appendixes. We do not propose here any new extension or any novel concept. We show more modestly that many of these features can be formally and (we believe) cleanly combined in a coherent manner.

Le langage de requete temporel FTL

Laurent Herr- 1995

Les requ

Multi-Scale Partitions: Application to Spatial and Statistical Databases

Michel Scholl, Philippe Rigaux- 1995

We study the impact of scale on data representation from both the modelling and querying points of view. While our starting point was geographical applications, statistical databases also address this problem of data representation at various levels of abstraction. From these requirements, we propose a model which allows: (i) database querying without exact knowledge of the data abstraction level, (ii) the computation of multiple representations of data, one per abstraction level, and (iii) its application to the computation of statistical summaries. The model has been partially implemented with the DBMS O2 by means of tree-structured domains: we give some examples which illustrate the above features.

Object Views of Relations

Claude Delobel, Cassio Souza Dos Santos, Didier Tallot- 1995

This paper investigates the problem of integration between relational and objectoriented databases. We discuss an approach based on class and attribute mappings and show how OQL, the ODMG query language, can be embedded in a set of primitives of a mapping language to serve as a basis for object-relational data integration. We then describe the technique used in the implementation of a prototype built on top of an object-oriented view mechanism, which allows the construction of an object view of a relational database.

On the Complexity of Generating Optimal Left-Deep Processing Trees with Cross Products

Sophie Cluet, Guido Moerkotte- 1995

Producing optimal left-deep trees is known to be NP-complete for general join graphs and a quite complex cost function counting disk accesses for a special block-wise nested-loop join [2]. Independent of any cost function is the dynamic programming approach to join ordering. The number of alternatives this approach generates is known as well [5]. Further, it is known that for some cost functions | those fulfulling the ASI property [4] | the problem can be solved in polynomial time for acyclic query graph, i.e., tree queries [2, 3]. Unfortunately, some cost functions like sort merge could not be treated so far. We do so by a slight detour showing that this cost function (and others too) are optimized if and only if the sum of the intermediate result sizes is minimized. This validates the database folklore that minimizing intermediate result sizes is a good heuristic. Then we show that summarizing the intermediate result sizes has the ASI property. It further motivates us to restrict the subsequent investigations to this cost function called Cout for which we show that the problem remains NP-complete in the general case. Then, we concentrate on the main topic of the paper: the complexity of producing left-deep processing trees possibly containing cross products. Considering cross products is known to possibly result in cheaper plans [5]. More specifically, we show that the problem (LD-X-Star) of generating optimal left-deep processing trees possibly containing cross products is NP-complete for star queries. Further, we give an algorithm for treating star queries which is more e

On the Expressive Power of Counting

Stephane Grumbach, Christophe Tollu- 1995

We investigate the expressive power of various extensions of first-order, inductive, and infinitary logic with counting quantifiers. We consider in particular a LOGSPACE extension of first-order logic, and a PTIME extension of fixpoint logic with counters. Counting is a fundamental tool of algorithms. It is essential in the case of unordered structures. Our aim is to understand the expressive power gained with a limited counting ability. We consider two problems: (i) unnested counters, and (ii) counters with no free variables. We prove a hierarchy result based on the arity of the counters under the first restriction. The proof is based on a game technique that is introduced in the paper. We also establish results on the asymptotic probabilities of sentences with counters under the second restriction. In particular, we show that first-order logic with equality of the cardinalities of relations has a 0/1 law.

On the Power of Languages for the Manipulation of Complex Values

Serge Abiteboul, Catriel Beeri- 1995

Various models and languages for describing and manipulating hierarchically structured data have been proposed. Algebraic, calculus-based and logic-programming oriented languages have all been considered. This paper presents a general model for complex values (i.e., values with hierarchical structures), and languages for it based on the three paradigms. The algebraic language generalizes those presented in the literature; it is shown to be related to the functional style of programming advocated by Backus. The notion of domain independence familiar from relational databases is defined, and syntactic restrictions (referred to as safety conditions) on calculus queries are formulated, that guarantee domain independence. The main results are: The domain-independent calculus, the safe calculus, the algebra, and the logic-programming oriented language have equivalent expressive power. In particular, recursive queries, such as the transitive closure, can be expressed in each of the languages. For this result, the algebra needs the powerset operation. A more restricted version of safety is presented, such that the restricted safe calculus is equivalent to the algebra without the powerset. The results are extended to the case where arbitrary functions and predicates are used in the languages.

Rule-base languages

Victor Vianu- 1995

The paper presents a survey of the main formal rule-based languages and semantics. Both procedural (fixpoint) and declarative (model-theoretic) semantics are defined and discussed, including inflationary and noninflationary fixpoint semantics, and the semipositive, stratified and well-founded semantics. The relative expressive power and complexity of the various languages are provided. Nondeterministic rule-based languages are also discussed, and it is shown how nondeterminism can circumvent some difficulties concerning the expressive power of the deterministic languages. Finally, languages with value invention (in the spirit of object-creation in oodbs) are presented and issues of expressive power specific to such languages are discussed.

Schema-Based authoring and querying of large hypertextes

Bernd Amann, Michel Scholl, Antoine Rizk- 1995

Modern hypertext applications require new system support for hypertext authoring and user navigation through large sets of documents connected by links. This system support must be based on advanced, typed data models for describing the information structure in different application domains. Schema based structuring through strongly typed documents and links has already been proposed and put to practical use in a multitude of hypertext applications. Systems such as Multicard/O2 and MORE have moreover exploited conceptual schemas for querying the resulting hyperdocuments in a more structured way. In this paper, we show how hypertext schemas and query languages can be utilized for designing hypertext authoring and browsing environments for large hypertexts. We illustrate our mechanisms using the Gram data model and describe their implementation on top of the Multicard hypermedia system connected to the O2 object-oriented database management system.

Semantics and Expressiveness Issues in Active Databases

Victor Vianu, Philippe Picouet- 1995

A formal framework is introduced for studying the semantics and expressiveness of active databases. The power of various abstract trigger languages is characterized, and related to several major active database prototypes such as ARDL, HiPAC, Postgres, Starburst, and Sybase. This allows to formally compare the expressiveness of the prototypes. The results provide insight into the programming paradigm of active databases, the interplay of various features, and their impact on expressiveness and complexity.

Simulation of Schema Change using Views

Philippe Breche, Fabrizio Ferrandina, Martin Kuklok- 1995

In this paper we concentrate on the description of change management functionalities that have been added to the commercially available O2 ODBMS to yield a platform suited for applications such as SDEs. A description of declarative high level schema update primitives is presented as well as mechanisms to propagate schema changes to the database. Particular emphasis is given to a mechanism that allows designers first to simulate schema and database modifications by means of object views, and second to materialize them to be propagated onto the real schema and base.

Space Usage in Functional Database Query Languages

Serge Abiteboul, Gerd Hillebrand- 1995

We consider evaluation strategies for database queries expressed in three functional query languages: the complex value algebra, the simply typed lambda calculus, and method schemas. Each of these query languages derives its expressive power from a different primitive: the complex value algebra from the powerset operator, the simply typed lambda calculus from list iteration, and method schemas from recursion. We show that "natural" evaluation strategies for these primitives may lead to very inefficient space usage, but that with some simple optimizations many queries can be evaluated with little or no space overhead. In particular, we show: (1) In the complex value algebra, all expressions with set nesting depth at most 2 can be evaluated in pspace, and this set of expressions is sufficient to express all queries in the polynomial hierarchy; (2) In the simply typed lambda calculus with equality and constants, all query terms of order at most 5 (where "query term" is a syntactic condition on types) can be evaluated in pspace, and this set of terms expresses exactly the pspace queries; (3) There exists a set of second-order method schemas (with no simple syntactic characterization) that can be evaluated in pspace, and this set of schemas is sufficient to express all pspace queries.

Temporal connectives versus explicit timestamps in temporal query languages

Serge Abiteboul, Laurent Herr, Jan Van den Bussche- 1995

Some temporal query languages work directly on a timestamp representation of the temporal database, while others provide a more implicit access to the flow of time by means of temporal connectives. We study the differences in expressive power between these two approaches. We first consider first-order logic (i.e., the relational calculus). We show that first-order future temporal logic is strictly less powerful than first-order past-future temporal logic, and that the latter is strictly less powerful than the relational calculus with explicit timestamps. We also consider extensions of the relational calculus with iteration constructs such as least fixpoints or while-loops. We again compare augmentations of these languages with temporal left and right moves on the one hand, and with explicit timestamps on the other hand. For example, we show that a version of fixpoint logic with left and right moves lies between the explicit timestamp versions of first-order and fixpoint logic, respectively.

A New Challenge for Compression Algorithms: Genetic Sequences

Stephane Grumbach, Fariza Tahi- 1994

Universal data compression algorithms fail to compress genetic sequences. It is due to the specificity of this particular kind of "text". We analyze in some details the properties of the sequences, which cause the failure of classical algorithms. We then present a lossless algorithm, biocompress-2, to compress the information contained in DNA and RNA sequences, based on the detection of regularities, such as the presence of palindromes. The algorithm combines substitutional and statistical methods, and to the best of our knowledge, lead to the highest compression of DNA. The results, although not satisfactory, gives insight to the necessary correlation between compression and comprehension of genetic sequences.

Classification and Optimization of nested Queries in Object Bases

Sophie Cluet, Guido Moerkotte- 1994

Many declarative query languages for object-oriented (oo) databases allow nested subqueries. This paper contains a complete classification of oo nested queries and appropriate unnesting optimization strategies based on algebraic rewriting. We adapt some known relational techniques and introduce new ones that use and are concerned with features specific to object-oriented queries. In particular, we introduce two new and powerful grouping operators which will form the basis for our unnesting techniques.

Computing With First-Order Logic

Serge Abiteboul, Victor Vianu- 1994

We study two important extensions of first-order logic (F O) with iteration, the fixpoint and while queries. The main result of the paper concerns the open problem of the relationship between fixpoint and while: they are the same iff ptime = pspace. These and other expressibility results are obtained using a powerful normal form for while which shows that each while computation over an unordered domain can be reduced to a while computation over an ordered domain via a fixpoint query. The fixpoint query computes an equivalence relation on tuples which is a congruence with respect to the rest of the computation. The same technique is used to show that equivalence of tuples and structures with respect to F O formulas with bounded number of variables is definable in fixpoint. Generalizing fixpoint and while, we consider more powerful languages which model arbitrary computation interacting with a database using a finite set of F O queries. Such computation is modeled by a relational machine called Loosely Coupled Generic Machine, GMloose. GMloose consists of a Turing Machine augmented with a finite set of fixed-arity relations forming a relational store. The connection with while is emphasized by a result showing that GMloose is equivalent to while augmented with integer variables and arithmetic. The normal form for while extends to these languages. We study the expressive power of GMloose and its ptime and pspace restrictions. We argue that complexity measures based on the size of the input may not be best suited for database computation. Instead, we suggest an alternative measure based on the discerning power of the machine, i.e. its ability to distinguish between tuples given its

Design and Implementation of an Object-Oriented View Mechanism

Cassio Souza Dos Santos- 1994

We discuss the design and implementation of the O2Views object-oriented database view mechanism, which allows the redefinition of both the structure and the behavior of objects stored in a database. The data model extended with views is first given and then the functionalities of the prototype implementing it are presented. The paper focuses on the requirements for the implementation of an object-oriented view mechanism, ranging from the conception of a view definition language to optimization strategies for querying and updating through a view such as view materialization and consistency maintenance.

Finitely Representable Databases

Stephane Grumbach, Jianwen Su- 1994

We study classes of infinite but finitely representable databases based on constraints, motivated by new database applications such as geographical databases. The mathematical framework is based on classical decidable first-order theories. We investigate the theory of finitely representable models and prove that it differs strongly from both classical model theory and finite model theory. In particular, we show that most of the well known theorems of either one fail (compactness, completeness, locality, 0/1 laws, etc.). An immediate consequence is the lack of tools to consider the definability of queries in the relational calculus over finitely representable databases. We illustrate this very challenging problem through some classical examples.

From Structured Documents to Novel Query Facilities

Serge Abiteboul, Michel Scholl, Sophie Cluet, Vassilis Christophides- 1994

Structured documents (e.g., SGML) can benefit a lot from database support and more specifically from object-oriented database (OODB) management systems. This paper describes a natural mapping from SGML documents into OODB's and a formal extension of two OODB query languages (one SQL-like and the other calculus) in order to deal with SGML document retrieval. Although motivated by structured documents, the extensions of query languages that we present are general and useful for a variety of other OODB applications. A key element is the introduction of paths as first class citizens. The new features allow to query data (and to some extent schema) without exact knowledge of the schema in a simple and homogeneous fashion.

Querying Structured Documents with Hypertext Links using OODBMS

Vassilis Christophides, Antoine Rizk- 1994

Hierarchical logical structure and hypertext links are complementary and can be combined to build more powerful document management systems [28, 25, 24, 13]. Previous work exploits this complementarity for building better document processors, browsers and editing tools, but not for building sophisticated querying mechanisms. Querying in hypertext has been a requirement since [19] and has already been elaborated in many hypertext systems [11, 7, 4, 21], but has not yet been used for hypertext systems superimposed on an underlying hierarchical logical structure. In this paper we use the model and the SQL-like query language of [10] in order to manage structured documents with hypertext links. The model represents a structured document with typed links as a complex object, and uses paths through the document structure, as first class citizens in formulating queries. Several examples of queries illustrate, from a practical point of view, the expressive power of the language to retrieve documents, even without exact knowledge of their structure in a simple and homogeneous fashion. It must be stressed that the proposed model and language implement the equivalent HyTime [1] Location Address Module. In fact, the language is more powerful than the corresponding HyQ query facilities. The implementation and the description throughout the paper use the SGML standard [2] to represent the document structure and the object-oriented DBMS O2 [12] to implement the query language and the storage module.

Querying Typed Hypertexts in Multicard/O2

Bernd Amann, Michel Scholl, Antoine Rizk- 1994

Due to the growing complexity of modern hypertext applications, current hypertext systems require new mechanisms to support authoring and user navigation through large sets of documents connected by links. A general solution is to extend hypertext systems to cater for semantics of application domains. This requires new hypertext models providing strongly typed documents and links. Such models have been proposed and put to use in systems such as HDM and MacWeb to facilitate authoring of large hypertexts. In addition, Gram and MORE use typing and graph-based hypertext schemas for querying hyperdocuments. In this paper, we will show how query languages could be further exploited for designing sophisticated general query-based navigation mechanisms. We illustrate our examples using the Gram model and describe an implementation with the hypermedia system Multicard connected to the objectoriented database management system O2.

The Power of Reflective Relational Machines

Serge Abiteboul, Victor Vianu, Christos H. Papadimitriou- 1994

A model of database programming with reflection, called reflective relational machine is introduced and studied. The reflection consists here of dynamic generation of queries in a host programming language. The main results characterize the power of the machine in terms of known complexity classes. In particular, the PTIME restriction of the machine is shown to express PSPACE, and to correspond precisely to uniform circuits of polynomial depth and exponential size. This provides an alternative, logicbased formulation of the uniform circuit model, more convenient for problems naturally formulated in logic terms. Since time in the polynomially-bounded machine coincides with time in the uniform circuit model, this also shows that reflection allows for more "intense" parallelism, not attainable otherwise. Other results concern the power of the reflective relational machine subject to restrictions on the number of variables used.

Virtual Schemas and Bases

Serge Abiteboul, Claude Delobel, Cassio Souza Dos Santos- 1994

We propose the notions of virtual schemas and virtual bases as a coherent way of integrating various features in OODB views. A virtual schema is defined based on some existing (real) schema. A virtual base is obtained when a (real) base is attached to a virtual schema. We study the consequences of this simple assumption. In particular, we observe the differences between a real schema and a virtual one. We also consider an extension (that we call generic schemas) where it is necessary to specify several real bases to attach data to a virtual schema. We show how the flexibility provided by virtual schemas can be used to cope with various dynamic features of database systems.

Asymptotic Probabilities of Languages with Generalized Quantifiers

Stephane Grumbach, Christophe Tollu, Guy Fayolle- 1993

We study the impact of adding certain families of generalized quantifiers to first-order logic (FO) on the asymptotic behavior of sentences. All our results are stated and proved for languages disallowing free variables in the scope of generalized quantifiers. For a class K of finite structures closed under isomorphism, the quantifier QK is said to be strongly monotonic, sm, if membership in the class is preserved under a loose form of extensions. Our first theorem (0/1 law for FO with any set of sm quantifiers) subsumes a previous criterion for proving that almost no graphs satisfy a given property [BH79]. We also establish a 0/1 law for FO with H-artig quantifiers (equicardinality quantifiers) and a limit law for a fragment of FO with Rescher quantifiers (expressing inequalities of cardinalities). The proofs of these last two results combine standard combinatorial enumerations with more sophisticated techniques from complex analysis. We also prove that the 0/1 law fails for the extension of FO with H-artig quantifiers if the above syntactic restriction is relaxed. We therefore get the best upper bound for the existence of a 0/1 law for FO with H-artig quantifiers.

Compression of DNA sequences

Stephane Grumbach, Fariza Tahi- 1993

We propose a lossless algorithm to compress the information contained in DNA sequences. None of the available universal algorithms compress such data. This is due to the specificity of genetic information. Our method is based on regularities, such as the presence of palindromes, in the DNA. The results we obtain, although not satisfactory, are far beyond classical algorithms.

Computing on Structures

Serge Abiteboul, Victor Vianu- 1993

One of the exciting developments in complexity theory is the discovery of a very intimate connection between computational complexity and logic. This intimate connection was first discovered by Fagin, who showed that the complexity class NP coincides with the class of properties expressible in existential 2nd-order logic [Fag74] (cf. [JS74]). Another demonstration of this connection was shown by Immerman and Vardi, who discovered tight relationships between the complexity class p and inflationary fixpoint logic [Imm86, Var82] and between the class pspace and noninflationary fixpoint logic. This initiated, during the 1980s, a new branch of complexity theory, which focuses on the descriptive complexity of problems, i.e. the complexity of describing problems in some logical formalism [Imm87b].

Database Query Navigation

Bernd Amann, Michel Scholl- 1993

In [AS92] a database model was defined which uses regular expressions over data types to formulate queries. We discuss its expressive power and utilize it as an underlying data model for designing general database navigation mechanisms. The application we have in mind is hypertext navigation and querying. However we believe that these mechanisms are also applicable on top of other data models such as the relational or object-oriented database models.

HyperPATH/O2: Integrating Hypermedia Systems with Object-Oriented Database Systems

Bernd Amann, Michel Scholl, Vassilis Christophides- 1993

We describe an integration of the hypermedia system HyperPATH with the object-oriented DBMS (OODBMS) O2. Providing persistence to a hypertext system was the first motivation of this work. More generally, we were interested in a better understanding of the connection of hypertext systems with OODBMS. One of our goals was to define an abstract interface between HyperPATH and a variety of OODBMS's. The solution adopted shows that opening HyperPATH to different OODBMS's is not possible without major rewriting of existing C++ code.

Nested Queries in Object Bases

Sophie Cluet, Guido Moerkotte- 1993

Many declarative query languages for object-oriented databases allow nested subqueries. This paper contains the first (to our knowledge) proposal to optimize them. A two-phase approach is used to optimize nested queries in the object-oriented context. The first phase--called dependency-based optimization--transforms queries at the query language level in order to treat common subexpressions and independent subqueries more efficiently. The transformed queries are translated to nested algebraic expressions. These entail nested loop evaluation which may be very inefficient. Hence, the second phase unnests nested algebraic expressions to allow for more efficient evaluation.

Querying and Updating the File

Serge Abiteboul, Sophie Cluet, Tova Milo- 1993

We show how structured data stored in files can benefit from standard database technology and in particular be queried and updated using database languages. We introduce the notion of structuring schema which consists of a grammar annotated with database programs and of a database schema. We study the translation from structured strings to databases, and the converse. We adapt optimization techniques from relational databases to our context.

Towards Tractable Algebras for Bags

Stephane Grumbach, Tova Milo- 1993

Bags, i.e. sets with duplicates, are often used to implement relations in database systems. In this paper, we study the expressive power of algebras for manipulating bags. The algebra we present is a simple extension of the nested relation algebra. Our aim is to investigate how the use of bags in the language extends its expressive power, and increases its complexity. We consider two main issues, namely (i) the impact of the depth of bag nesting on the expressive power, and (ii) the complexity and the expressive power induced by the algebraic operations. We show that the bag algebra is more expressive than the nested relation algebra (at all levels of nesting), and that the difference may be subtle. We establish a hierarchy based on the structure of algebra expressions. This hierarchy is shown to be highly related to the properties of the powerset operator.

A General Framework for the Optimization of Object-Oriented Queries

Sophie Cluet, Claude Delobel- 1992

The goal of this work is to integrate in a general framework the different query optimization techniques that have been proposed in the object-oriented context. As a first step, we focus essentially on the logical aspect of query optimization. In this paper, we propose a formalism (i) that unifies different rewriting formalisms, (ii) that allows easy and exhaustive factorization of duplicated subqueries, and (iii) that supports heuristics in order to reduce the optimization rewriting phase.

Application of a Graph Model to Hypertext Querying

Bernd Amann, Michel Scholl- 1992

We present a model for data organized as graphs with typed nodes and edges. Regular expressions over the types of the node and edge labels are used to select walks in the hypertext graph. An outline of the application of this model and its query language to the implementation of hypertext documents is given by using an extended example of a travel agency application.

Computing with infinitary Logic

Serge Abiteboul, Victor Vianu, Moshe Y. Vardi- 1992

Most recursive extensions of relational calculus converge around two central classes of queries: fixpoint and while. Infinitary logic (with finitely many variables) is a very powerful extension of these languages which provides an elegant unifying formalism for a wide variety of query languages. However, neither the syntax nor the semantics of infinitary logic are effective, and its connection to practical query languages has been largely unexplored. We relate infinitary logic to another powerful extension of fixpoint and while, called relational machine, which highlights the computational style of these languages. Relational machines capture the kind of computation occurring when a query language is embedded in a host programming language, as in C+SQL. The main result of this paper is that relational machines correspond to the natural effective fragment of infinitary logic. Other well-known query languages are related to infinitary logic using syntactic restrictions formulated in language-theoretic terms. For example, it is shown that while corresponds to infinitary logic formulas which can be described by a regular language. As a side effect to these results, we obtain interesting normal forms for infinitary logic formulas.

Deductive and Object-Oriented Databases

Serge Abiteboul- 1992

These notes were written for a tutorial in JICSLP (Washington, 1992) and are not intended to be a comprehensive survey of the field. We are concerned with dooD, i.e. deductive object-oriented databases (systems). We briefly review ooD systems (object-oriented database management systems) and deductive database systems (the deductive ones). A main purpose is to try to clarify these concepts. As we shall see, although there is already a conference entitled dooD and a large body of literature on the topic, the meaning of dooD is not yet clear. The issues are not precise and we are still looking forward to the first real implementation of a dooD system. We will try to present the motivations, the main achievements so far, and the open questions.

Fixpoint Logics, Relational Machines, and Computational Complexity

Serge Abiteboul, Victor Vianu, Moshe Y. Vardi- 1992

We establish a general connection between fixpoint logic and complexity. On one side, we have fixpoint logic, parameterized by the choices of 1st-order operators (inflationary or noninflationary) and iteration constructs (deterministic, nondeterministic, or alternating). On the other side, we have the complexity classes between P and EXPTIME. Our parameterized flxpoint logics capture the complexity classes P, NP, PSPACE, and EXPTIME, but equality is achieved only over ordered structures. There is, however, an inherent mismatch between complexity and logic { while computational devices work on encodings of problems, logic is applied directly to the underlying mathematical structures. To overcome this mismatch, we develop a theory of relational complexity, which bridges tha gap between standard complexity and fixpoint logic. On one hand, we show that questions about containments among standard complexity classes can be translated to questions about containments among relational complexity classes. On the other hand, the expressive power of fixpoint logic can be precisely characterized in terms of relational complexity classes. This tight three-way relationship among fixpoint logics, relational complexity and standard complexity yields in a uniform way logical analogs to all containments among the complexity classes P, NP, PSPACE, and EXPTIME. The logical formulation shows that some of the most tantalizing questions in complexity theory boil down to a single question: the relative power of inflationary vs. noninflationary 1st-order operators.

GRAM: A Graph Data Model and Query Language

Bernd Amann, Michel Scholl- 1992

We present a model for data organized as graphs. Regular expressions over the types of the node and edge labels are used to qualify connected subgraphs. An algebraic language based on these regular expressions and supporting a restricted form of recursion is introduced. A natural application of this model and its query language is hypertext querying.

Optimizing Active Databases using the Split Technique

Serge Abiteboul, Allen Van Gelder- 1992

A method to perform nonmonotonic relational rule computations is presented, called the split technique, The goal is to avoid redundant computations with rules that can insert and delete sets of tuples specified by the rule body. The method is independent of the control strategy that governs rule firing. Updatable relations are partitioned, as the computation progresses, into blocks of tuples such that tuples within a block are indiscernible from each other based on the computation so far. Results of previous rule firings are remembered as "relational equations" so that a new rule firing does not recompute parts of the result that can be determined from the existing equations. Seminaive evaluation falls out as a special case when all rules specify inserts. The method is amenable to parallelization.

Queries are easier than you thought (probably)

Serge Abiteboul, Victor Vianu, Kevin Compton- 1992

The optimization of a large class of queries is explored, using a powerful normal form recently proven. The queries include the fixpoint and while queries, and an extension of while with arithmetic. The optimization method is evaluated using a probabilistic analysis. In particular, the average complexity of fixpoint and while is considered and some surprising results are obtained. They suggest that the worst-case complexity is sometimes overly pessimistic for such queries, whose average complexity is often much more reasonable than the provably rare worst case. Some computational properties of queries are also investigated. A probabilistic notion of boundedness is defined, and it is shown that all programs in the class considered are bounded almost everywhere. An effective way of using this fact is provided.

Expressive Power of Query Languages

Serge Abiteboul, Victor Vianu- 1991

Recent research on query languages and their expressive power is discussed. Several query languages are described, emphasizing recursive extensions of the first-order queries based on three paradigms: logic, algebraic, and logic programming. Many of these languages converge around two central classes of queries, the fixpoint queries and the while queries. The relative expressive power of these languages is examined, as well as their connection to complexity classes of queries. The focus is on the inability of languages to express low complexity classes of queries, ptime and below. We consider several ways to circumvent this difficulty. The first is to introduce an ordering of the domain which leads to a trade-off between complexity and the data independence principle. The cost of computing without order is formalized by defining non-standard complexity classes based on a model of database computation called Generic Machine. As an alternative to order, it is shown that the difficulty of expressing low complexity classes can be circumvented by introducing non-deterministic constructs in query languages. Expressiveness above np is also discussed.

Fundamental properties of deterministic and nondeterministic extensions of Datalog

Serge Abiteboul, Eric Simon- 1991

Fundamental properties of deterministic and nondeterministic extensions of Datalog from [AV88] are studied. The extensions involve the use of negative literals both in bodies and heads of rules. Negative literals in heads are interpreted as deletions. A deterministic semantics is obtained by firing in parallel all applicable rules. The nondeterministic semantics results from firing (nondeterministically) one rule at a time. In the nondeterministic case, programs do not describe functions but relations between database states. In both cases, the result is an increase in expressive power over Datalog. The price for it is that programs do not always terminate. We study when a program (i) is such that on a given input, all its successful computations reach a unique fixpoint, (ii) yields at least one output on every input and (iii) has only loop-free computations. We also show how to simulate programs containing loops by loop-free programs.

Generic Computation and Its Complexity

Serge Abiteboul, Victor Vianu- 1991

There is a fundamental mismatch between the hardness of database queries and their Turing complexity. For example, the even query on a set has low Turing complexity but is by all accounts a hard query. The mismatch is due to the abstract, generic nature of database computation: data items which are indistinguishable by logical properties are treated uniformly. The issues specific to generic computation are obscured in the Turing model. Two models of generic computation are proposed. They are extensions of Turing Machines with a relational store. The machines differ in the interaction between the Turing component and the relational store. The first, simpler machine, allows limited communication with the relational store. However, it is not complete. Nonetheless, it subsumes many query languages which have emerged as central, including the fixpoint and while queries. We prove a normal form for the machine, which provides a key technical tool. The normal form specialized to the while queries allows resolving the open problem of the relation of fixpoint and while: they are equivalent i, ptime = pspace. The second machine allows extended communication between the relational store and the Turing component, and is complete. The machine involves parallelism. Based on it, we define complexity measures for generic mappings. We focus on notions of polynomial time and space complexity for generic mappings, and show their robustness. The results point to a trade-o, between complexity and computing with an abstract interface.

Objects and Views

Serge Abiteboul, Anthony Bonner- 1991

Object-oriented databases have been introduced primarily to ease the development of database applications. However, the difficulties encountered when, for instance, trying to restructure data or integrate databases demonstrate that the models being used still lack flexibility. We claim that the natural way to overcome these shortcomings is to introduce a sophisticated view mechanism. This paper presents such a mechanism, one which allows a programmer to restructure the class hierarchy and modify the behavior and structure of objects. The mechanism allows a programmer to specify attribute values implicitly, rather than storing them. It also allows him to introduce new classes into the class hierarchy. These virtual classes are populated by selecting existing objects from other classes and by creating new objects. Fixing the identify of new objects during database updates introduces subtle issues into view design. Our presentation, mostly informal, leans on a number of illustrative examples meant to emphasize the simplicity of our mechanism.

On the Representation and Querying of Sets of Possible Worlds

Serge Abiteboul, Paris Kanellakis, Gosta Grahne- 1991

We represent a set of possible worlds using an incomplete information database. The representation techniques that we study range from the very simple Codd-table (a relation over constants and uniquely occurring variables called nulls) to much more complex mechanisms involving views of conditioned-tables (programs applied to Coddtables augmented by equality and inequality conditions). (1) We provide matching upper and lower bounds on the data-complexity of testing containment, membership, and uniqueness for sets of possible worlds. We fully classify these problems with respect to our representations. (2) We investigate the data-complexity of querying incomplete information databases for both possible and certain facts. For each fixed positive existential query on conditioned-tables we present a polynomial time algorithm solving the possible fact problem. We match this upper bound by two NP-completeness lower bounds, when the fixed query contains either negation or recursion and is applied to Codd-tables. Finally, we show that the certain fact problem is coNP-complete, even for a fixed first order query applied to a Codd-table.

Towards a deductive object-oriented database language

Serge Abiteboul- 1990

A language for databases with sets, tuples, lists, object identity and structural inheritance is proposed. The core language is logic-based with a fixpoint semantics. Methods with overloading and methods evaluated externally providing extensibility of the language are considered. Other important issues such as updates and the introduction of explicit control are discussed.

On the power of languages for the manipulation of complex objects

Serge Abiteboul, Catriel Beeri- 1988

Various models and languages for describing and manipulating hierarchically structured data have been proposed. Algebraic, calculus-based and logic-programming oriented languages have all been considered. This paper presents a general model for complex objects, and languages for it based on the three paradigms. The algebraic language generalizes those presented in the literature; it is shown to be related to the functional style of programming advocated by Backus. The notion of domain independence familiar from relational databases is defined, and syntactic restrictions (referred to as safety conditions) on calculus queries are formulated, that guarantee domain independence. The main results are: The domain-independent calculus, the safe calculus, the algebra, and the logic-programming oriented language have equivalent expressive power. In particular, recursive queries, such as the transitive closure, can be expressed in each of the languages. For this result, the algebra needs the powerset operation. A more restricted version of safety is presented, such that the restricted safe calculus is equivalent to the algebra without the powerset. The results are extended to the case where arbitrary functions and predicates are used in the languages.

Non-determinism in logic-based languages

Serge Abiteboul, Victor Vianu- 1991

The use of non-determinism in logic-based languages is motivated using pragmatic and theoretical considerations. Non-deterministic database queries and updates occur naturally, and there exist non-deterministic implementations of various languages. It is shown that non-determinism resolves some difficulties concerning the expressive power of deterministic languages: there are non-deterministic languages expressing low complexity classes of queries/updates, whereas no such deterministic languages are known. Various mechanisms yielding non-determinism are reviewed. The focus is on two closely related families of non-deterministic languages. The first consists of extensions of Datalog with negations in bodies and/or heads of rules, with non-deterministic fixpoint semantics. The second consists of non-deterministic extensions of first-order logic and fixpoint logics, using thewitness operator. The expressive power of the languages is characterized. In particular, languages expressing exactly the (deterministic and non-deterministic) queries/updates computable in polynomial time are exhibited, whereas it is conjectured that no analogous deterministic language exists. The connection between non-deterministic languages and determinism is also explored. Several problems of practical interest are examined, such as checking (statically or dynamically) if a given program is deterministic, detecting coincidence of deterministic and non-deterministic semantics, and verifying termination for non-deterministic programs.