Sciences des données : de la logique du premier ordre à la Toile

Serge Abiteboul

Table des matières

Dispositives des cours
Dispositives des séminaires
Résumé des cours
Résumé des séminaires en français
Résumé des séminaires en anglais
Pour les vidéos en français et anglais, voir le site du Collège

Leçon inaugurale

Texte en français (pdf et epub sur demande)
Text in English
Powerpoint en français, (PDF)
Powerpoint in English, (PDF)

Synthèse de la leçon

L’information produite, stockée, traitée, échangée, est au cœur de l’activité des êtres vivants, des objets du monde, des associations humaines. Les systèmes informatiques nous aident à conserver cette information sous forme numérique telle une sauvegarde quasi illimitée de notre mémoire personnelle. Ils nous aident à traiter et échanger cette information pour communiquer entre nous. L’ordre de grandeur de l’information stockée atteint le zettaoctet ; 1021 octets ! Le trafic d’information annuel sur Internet dépasse même cette quantité d’information accumulée. Face à ces chiffres vertigineux, deux problèmes s’imposent : Où trouver la bonne information dans cette masse ? Comment déterminer ce que l’on veut conserver ?

Avec les efforts combinés d’une recherche académique dynamique, de pionniers marquants comme IBM, de jeunes géants comme Google et de startups hyper créatives, les sciences des données se sont épanouies. Pourtant le domaine tient encore de la forêt vierge quand nous atteignons la gestion de données distribuées et la Toile. Il est compliqué d’en dresser l’état de l’art ; il n’est pas simple de l’enseigner ; il n’est pas évident de prévoir quelles tendances sont là pour durer. C’est cette jungle que nous chercherons à pénétrer.

Les systèmes de gestion de bases de données relationnels sont des systèmes informatiques complexes, résultats de dizaines d’années de recherche et de développement. Ils sont parmi les plus grands succès logiciels du siècle dernier avec des produits commerciaux très répandus comme les serveurs Oracle et des systèmes gratuits très utilisés comme MySQL. Ils résultent de la combinaison de bases mathématiques solides (comme la logique du premier ordre), d’algorithmes très sophistiqués, et d’un engineering complexe.

Nous retrouvons ces trois mêmes ingrédients à la base des moteurs de recherche de la Toile. La Toile, le World Wide Web en anglais, s’appuie sur des documents hypermédia. Un moteur de recherche permet de fuir la navigation fastidieuse sur le graphe des pages et le monde de l’hypertexte pour plonger dans une bibliothèque numérique universelle. Si la Toile n’a sûrement pas de réponse à toutes les questions de l’internaute, la réponse à une question précise se trouve peut-être dans les masses d’informations véritablement extraordinaires disponibles. Tels des enfants, nous nous émerveillons devant les dizaines de milliards de documents de la Toile. Mais un enfant apprend, depuis son plus jeune âge, à évaluer, classer, filtrer le volume considérable d’informations qu’il rencontre. Et nous ? Si le moteur de recherche ne nous aidait pas à nous focaliser sur un petit nombre de pages, que ferions-nous ? L’exploit technique, c’est de retrouver en un instant, grâce à un index, les pages de la Toile qui hébergent les quelques mots d’une requête. La magie, expliquée par quelques équations et des algorithmes, c’est de pouvoir retrouver, parmi les dizaines, voire centaines de millions de pages qui contiennent les mots demandés, quelques pages qui vont satisfaire l’internaute.

L’écriture nous a permis de matérialiser et d’externaliser en partie notre mémoire. L’imprimerie nous a permis de transmettre largement cette mémoire externe. On a beaucoup insisté sur le fait que la Toile diminuait considérablement les coûts de transmission de la mémoire. Nous sommes en train de découvrir que sa véritable révolution est de permettre à chacun d’apporter sa contribution personnelle au patrimoine collectif (avec des réserves comme la fracture numérique). La Toile est ainsi une juxtaposition de milliards d’individus et de tous leurs réseaux. Après les réseaux de machines, les réseaux de contenus, nous atteignons les réseaux d’utilisateurs.

Des systèmes de la Toile, tels Facebook, permettent aux internautes de communiquer entre eux. Ce ne serait pas vraiment nouveau si ces nouveaux outils de communication ne conduisaient à d’autres modes de pensées, d’autres formes de relations. Surtout, phénomène véritablement passionnant, ces systèmes font émerger automatiquement, depuis les profondeurs des réseaux, des connaissances collectives. Plusieurs types d’approches permettent de construire de telles connaissances: la notation, par exemple, de produits ou d’entreprises par des internautes comme dans eBay ; l’évaluation de l’expertise des internautes comme dans Mechanical Turk ; la recommandation par exemple de produits comme dans Netflix ; la collaboration entre internautes pour réaliser collectivement une tâche qui les dépasse individuellement comme dans Wikipedia ; enfin, le crowdsourcing met des humains au service de systèmes informatiques comme avec Foldit. L’émergence automatique de telles connaissances soulève toute une gamme de questions, tant philosophiques que scientifiques.

En observant les évolutions de la Toile et des sciences des données, nous pouvons imaginer ce que pourra être la Toile de demain, une Toile des connaissances, avec des millions, voire des milliards de machines interconnectées raisonnant collectivement. La fascinante Toile des documents d’aujourd’hui est fondée sur le plaisir des gens à écrire, lire, dire, écouter du texte dans leurs langues naturelles. Les machines préfèrent échanger des connaissances plus formatées, plus rigoureuses. Avec le passage de la Toile du texte à une Toile des connaissances, elles pourront prendre plus pleinement en main la gestion de nos informations. Cela paraît une étape indispensable pour que l’humanité puisse survivre dans les flots d’information chaque jour plus cataclysmique qu’elle génère.

La Toile est multiforme et il est devenu quasi impossible de vivre sans elle. Elle est à la fois la plus belle des dentelles, trame de toutes connaissances humaines, et terreau des plus horribles fantasmes, de toutes les violences. Il n’est pas possible, ni souhaitable, d’y renoncer comme il n’a pas été possible de refuser l’écriture ou l’imprimerie. Et malgré tous les écueils, nous voulons continuer à croire que la Toile participera à féconder un meilleur futur. Quant aux aspects plus techniques, nous nous hasarderons à affirmer que la prochaine étape en sciences des données a déjà commencé : c’est la construction de la Toile des connaissances. Des données, à l’information, aux connaissances, le cheminement est logique.

Diapositives des Cours de Serge Abiteboul

Le modèle relationnel, The relational model

Au-delà du modèle relationnel, Beyond the relational model

Le web sémantique, The Semantic Web (ppt)

Documents actifs et AXML, Active documents and AXML (ppt)

Moteur de recherche de la Toile, Web search engine (pptx)

Datalog: La renaissance, Datalog Revival (pptx)

Gestion de données distribuées, Distributed data management (pdf)

Datalog distribué et Webdamlog, Distributed datalog and Webdamlog (pdf)

Diapositives des séminaires

Moshe Vardi, Rice University, Requêtes bases de données – logique et complexité, Database Queries – Logic and Complexity

Anastasia Ailamaki, E.P.F. Lausanne, Gestion de données scientifiques, Scientific Data Management: Not your everyday transaction

François Bancilhon, Data Publica, Ouverture des données publiques, Open Data

Julien Masanès, Internet Memory Foundation, Archivage du Web, Web archiving

Victor Vianu, U.C. San Diego, Analyse statique et vérification, Static Analysis and Verification

Tova Milo, Tel Aviv University, Le crowdsourcing de données, Data crowdsourcing

Georg Gottlob, Oxford University, Extraction de données du Web, Extracting data from the Web

Gerhard Weikum, Max-Planck-Institut, Récolte des connaissances du Web, Gathering knowledge on the Web

Marie-Christine Rousset, U. Grenoble, Raisonnement dans le Web sémantique, Reasoning on Web Data Semantics

Pierre Senellart, Télécom ParisTech, Social networks, Réseaux sociaux

Résumé des cours

Modèle relationnel

Nous discutons le modèle relationnel à la base des systèmes de gestion de bases de données. Ce modèle simplifie considérablement la gestion de données en servant le rôle de médiateur entre humains et machines.

Au-delà du modèle relationnel

Nous nous intéressons à des modèles qui cherchent à aller plus loin ou à faire mieux que le modèle relationnel. Il est intéressant de noter que si les prédecesseurs principaux du modèle relationnel se fondaient sur des arbres et des graphes, ses successeurs les plus célébrés sont aussi basés sur des arbres et des graphes.

Le Web sémantique

Le but du Web Sémantique est faciliter l'accès à l'information et aux connaissances. Il s'agit d'améliorer la précision des résultats de recherche et de faciliter l'intégration de sources distinctes. L’idée est de publier des connaissances compréhensibles par des machines plutôt que du texte plus adapté à des humains.

Documents actifs et AXML

Nous nous intéressons à la collaboration pour gérer des données entre des serveurs autonomes et hétérogènes. Pour cela, nous utilisons des arbres qui incluent des appels à des fonctions (des services du Web). Ces fonctions capturent la notion de vues (des données intentionnelles situées ailleurs) et permettent de spécifier du calcul distribué.

Moteur de recherche de la Toile

Nous expliquons comment, à partir d’une requète avec quelques mots clés, le moteur de recherche retrouve les pages qui semblent les plus pertinentes en utilisant principalement une indexation de la Toile et un algorithme de classement des pages par popularité

Datalog: La renaissance

Datalog proposé dans les années 1970 introduit la récursivité dans la partie positive des requêtes relationnelles. Nous décrirons les traits principaux du langage et discuterons de sa renaissance ces dernières années.

Gestion de données distribuées

L'utilisation de la distribution permet d'améliorer les performances des systèmes de gestion de données. La distribution se retrouve aussi dans de nombreuses applications quand les données sont naturellement distribuées entre plusieurs systèmes. Avec le Web, la gestion de données distrbuées a pris une importance considérable.

Datalog distribué et Webdamlog

Nous parlerons du langage WebdamLog et du systèmes du même nom. Ce sont nos travaux récents autour d'un datalog distribué.

Résumé des séminaires en français

Requêtes de bases de données – Logique et complexité, Moshe Vardi, Rice University

Gestion de données scientifique, Anastasia Ailamaki, E.P.F. Lausanne

Analyse statique et vérification, Victor Vianu, U.C. San Diego

Le crowdsourcing de donnée, Tova Milo, Tel Aviv University

Extraction de données du We, Georg Gottlob, Oxford University

Récolte des connaissances du We, Gerhard Weikum, Max-Planck-Institut

Ouverture des données publiques, François Bancilhon

L’ouverture des données publiques (open data) est le fait de rendre disponible les données collectées, gérées et utilisées par la puissance publique pour accès et réutilisation par les citoyens et organisations (publiques ou privées). Dans la plupart des démocraties, de plus en plus de données sont rendues publiques par ce mouvement, lancé aux États-Unis en 2009 par l’initiative data.gov. Ce flot de nouvelles données présente à la fois une opportunité majeure et des défis technologiques importants. L’opportunité est celle des nouvelles applications et des nouveaux usages qui peuvent être fait de ces données et de la nouvelle compréhension que les citoyens qui y accèdent en ont. Les défis sont multiples : ces données sont souvent pauvrement structurées et formatées, elles sont parfois de qualité médiocres, et enfin elles sont fragmentées sous la forme de milliers ou de millions de fichiers contenant des informations complémentaires ou dupliquées. Pour utiliser ces données fragmentées, peu structurées et de qualité variables, plusieurs approches sont possibles. On peut laisser les données telles quelles et déplacer l’intelligence dans l’application qui les utilise, souvent à travers un moteur de recherche. On peut utiliser une approche de type web sémantique en convertissant les données en rdf et en établissant des liens entre des entités identifiées. Enfin on peut les structurer sous forme de bases de données, certaines d’entre elles étant alignées sur des attributs communs (par exemple espace et temps).

Diplômé de l'École des Mines de Paris, titulaire d'un PhD de l'Université du Michigan et d'une Thèse d'État de l'Université de Paris XI, François Bancilhon a eu une double carrière. Une première carrière dans la recherche académique (INRIA, MCC et Université de Paris XI). Une deuxième carrière dans l'industrie : entrepreneur, il a co-créé et/ou dirigé plusieurs entreprises, (O2 Technology, Arioso, Xylème, Ucopia, Mandriva, Data Publica). Il a partagé sa vie professionnelle entre la France et les États-Unis. Il est actuellement directeur exécutif de l'Initiative Services Mobiles pour l'INRIA, un groupement d'animation de l'éco-système mobile et CEO de Data Publica, une société qui gère un « Data Store » et développe des jeux de données sur mesure pour ses clients.

Archivage du Web, Julien Masanès

Le Web représente la plus grande source d'information ouverte jamais produite dans l'histoire. Dépassant de plusieurs ordres de grandeur la sphère de l'imprimé, il offre également des caractéristiques inédites par rapport aux média qui l'ont précédé, telle l'édition collective à laquelle participe, même marginalement, une fraction importante de l'humanité, la dynamique temporelle complexe et le caractère paradoxal des traces créées, à la fois omniprésentes et fragiles.

Ces caractéristiques uniques sont aussi celles qui en font une source d'information, d'analyse et d'étude majeure, faisant de la conservation de sa mémoire un enjeu important pour l'avenir. Elles obligent cependant à refonder les méthodes et les pratiques séculaires de préservation des artefacts culturels.

Nous analyserons dans cet exposé les propriétés du Web vu de cet angle particulier qu'offre la problématique de sa préservation et présenterons quelques réflexions sur la manière dont sa mémoire peut être construite pour servir la science à l'avenir.

Julien Masanès, conservateur de Bibliothèque, est le co-fondateur et le directeur de la fondation Internet Memory pour la préservation et l’accès à la mémoire de l’Internet. Il était précédemment en charge du projet d’Archivage Web à la Bibliothèque nationale de France. Il a activement participé à la création de l’International Internet Preservation Consoritum (IIPC), regroupant 27 bibliothèques nationales et l’Internet Archive, consortium qu’il a coordonné durant les 2 premières années. Il a également initié et préside l’International Web Archiving Workshop (IWAW), la principale conférence dans le domaine de la préservation Internet. Il a édité le premier livre sur le sujet (Web Archiving, Springer 2006).

Julien Masanès a étudié la Philosophie (DEA Sorbonne 1992) et les Sciences Cognitive (DEA EHESS 1994) ainsi que la bibliothéconomie (DCB ENSSIB 2000).

Raisonnement dans le Web sémantique, Marie-Christine Rousset

Prendre en compte la sémantique des données extraites du Web est fondamental pour construire des applications fiables intégrant ces données. Associer une sémantique formelle aux données multi-sources et multi-formes du Web est un défi mais aussi une clé pour résoudre de manière robuste et générique, par des techniques de raisonnement automatique, des problèmes difficiles comme l’interopérabilité entre ressources distribuées et hétérogènes ainsi que la vérification de propriétés de sécurité ou de qualité de service spécifiées formellement. La prise en compte de la sémantique est également primordiale dans la recherche d’informations et l’évaluation de requêtes sur le Web. De nombreux travaux émanant de la communauté du Web sémantique ont été réalisés pour décrire la sémantique d'applications par la construction déontologies. Cependant, les problèmes de raisonnement considérés sur les ontologies sont trop souvent déconnectés des données dont elles décrivent la sémantique. Seuls, quelques travaux considèrent les ontologies comme des interfaces de requêtes entre des utilisateurs ou des applications et des données. Ces travaux montrent par des arguments de complexité les limites qu'il faut imposer au pouvoir d'expression des ontologies pour espérer des algorithmes d'évaluation de requêtes qui passent à l’échelle.

Dans cet exposé, nous montrerons que les logiques de description sont un bon modèle pour décrire la sémantique des données du Web mais nous montrerons que des restrictions sont nécessaires pour obtenir des algorithmes de raisonnement qui passe à l'échelle pour détecter des incohérences ou des corrélations logiques entre données ou sources de données, et pour calculer l'ensemble des réponses à des requêtes conjonctives. Nous montrerons que la famille DL-Lite a de bonnes propriétés pour combiner raisonnement et gestion de données à grande échelle, qui en font un bon candidat comme modèle de données du Web sémantique.

Réseaux sociaux, Pierre Senellart

Pierre Senellart est maître de conférences dans l'équipe DBWeb de Télécom ParisTech, la première école d'ingénieurs française spécialisée dans les technologies de l'information. Il est un ancien élève de l' École normale supérieure et a obtenu son Master (2003) et doctorat (2007) en informatique de l'Université Paris-Sud, sous la direction de Serge Abiteboul. Pierre Senellart a publié des articles dans des conférences et journaux renommés internationalement (PODS, AAAI, VLDB Journal, Journal of the ACM, etc.) Il a fait partie des comités de programme de, et a contribué à organiser, diverses conférences et workshops internationaux (dont WWW, CIKM, ICDE, VLDB, SIGMOD, ICDT). Il est également le directeur de l'information du Journal of the ACM. Ses intérêts de recherche se concentrent sur les aspects théoriques des systèmes de gestion de bases de données et sur le World Wide Web, et plus précisément sur l'indexation en compréhension du Web caché, les bases de données XML probabilistes et la fouille de graphe. Il a également un intérêt pour le traitement automatique de la langue et a collaboré avec SYSTRAN, la première entreprise de traduction automatique.

Résumé des séminaires en anglais

Database Queries – Logic and Complexity, Moshe Y. Vardi

Mathematical logic emerged during the early part of the 20 Century, out of a foundational investigation of mathematics, as the basic language of mathematics. In 1970 Codd proposed the relational database model, based on mathematical logic: logical structures offer a way to model data, while logical formulas offer a way to express database queries. This proposal gave rise to a multi-billion dollar relational database industry as well as a rich theory of logical query languages.

This talk will offer an overview of how mathematical logic came to provide foundations for one of today's most important technologies, and show how the theory of logical queries offer deep insights into the computational complexity of evaluating relational queries.

Moshe Y. Vardi is the George Distinguish Service Professor in Computational Engineering and Director of the Ken Kennedy Institute for Information Technology Institute at Rice University. He is the co-recipient of three IBM Outstanding Innovation Awards, the ACM SIGACT Goedel Prize, the ACM Kanellakis Award, the ACM SIGMOD Codd Award, the Blaise Pascal Medal, and the IEEE Computer Society Goode Award. He is the author and co-author of over 400 papers, as well as two books: Reasoning about Knowledge and Finite Model Theory and Its Applications. He is a Fellow of the Association for Computing Machinery, the American Association for Artificial Intelligence, the American Association for the Advancement of Science, and the Institute for Electrical and Electronic Engineers. He is a member of the US National Academy of Engineering, the American Academy of Arts and Science, the European Academy of Science, and Academia Europea. He holds honorary doctorates from the Saarland University in Germany and Orleans University in France. He is the Editor-in-Chief of the Communications of the ACM.

Scientific Data Management: Not your everyday transaction, Anastasia Ailamaki

Today's scientific processes heavily depend on fast and accurate analysis of experimental data. Scientists are routinely overwhelmed by the effort needed to manage the volumes of data produced either by observing phenomena or by sophisticated simulations. As database systems have proven inefficient, inadequate, or insufficient to meet the needs of scientific applications, the scientific community typically uses special-purpose legacy software. When compared to a general-purpose data management system, however, application-specific systems require more resources to maintain, and in order to achieve acceptable performance they often sacrifice data independence and hinder the reuse of knowledge. With the exponential growth of dataset sizes, data management technology are no longer luxury; they are the sole solution for scientific applications.

I will discuss some of the work from teams around the world and the requirements of their applications, as well as how these translate to challenges for the data management community. As an example I will describe a challenging application on brain simulation data, and its needs; I will then present how we were able to simulate a meaningful percentage of the human brain as well as access arbitrary brain regions fast, independently of increasing data size or density. Finally I will present some of the dat management challenges that lie ahead in domain sciences.

Anastasia Ailamaki is a Professor of Computer Sciences at the Ecole Polytechnique Federale de Lausanne (EPFL) in Switzerland. Her research interests are in database systems and applications, and in particular (a) in strengthening the interaction between the database software and emerging hardware and I/O devices, and (b) in automating database management to support computationally-demanding and demanding data-intensive scientific applications. She has received a Finmeccanica endowed chair from the Computer Science Department at Carnegie Mellon (2007), a European Young Investigator Award from the European Science Foundation (2007), an Alfred P. Sloan Research Fellowship (2005), seven best-paper awards at top conferences (2001-2011), and an NSF CAREER award (2002). She earned her Ph.D. in Computer Science from the University of Wisconsin-Madison in 2000. She is a member of IEEE and ACM, and has also been a CRA-W mentor.

Open Data, François Bancilhon

Open Data consists in making available to the general public and to private and public organization PSI (public sector information) for access and reuse. More and more open data is becoming available in most democratic countries following the launch of the data.gov initiative in the US in 2009. The availability of this new information brings a number opportunities and raises a number of challenges. The opportunities are the new applications that companies and organisations can build using this data and the new understanding given to the people who access it. The challenges are the following: most of this data is usually in a poor format (poorly structured xls tables or in some cases even pdf), it is often of poor quality, and it is fragmented in thousands or millions of files with duplicate and/or complementary information. To use these fragmented, poorly structured and poor quality files, several approaches can be used, not necessarily mutually exclusive. One is to move the intelligence from the data into the application and to develop search based applications which directly manages the data as is. Another one is to bring some order in the data using a semantic web approach: converting the data in rdf, identifying entities and linking them from one data set to the other. And a final one is to structure the data by aligning data sets on common attribute and structure, to get closer to a uniform data base scheme.

François is currently CEO of Data Publica, a key actor of the Open Data space in France and CEO of the Mobile Services Initiative for INRIA. He has co-founded and/or managed several software startups in France and in the US (Data Publica, Mandriva, Arioso, Xyleme, Ucopia, O2 Technology). Before becoming an entrepreneur, François was a researcher and a university professor, in France and the US, specializing in database technology. François holds an engineering degree from the École des Mines de Paris, a PhD from the University of Michigan and a Doctorate from the University of Paris XI.

Web archiving, Julien Masanès

The Web represents the largest source of open information ever produced in history. Larger than the printed sphere by several order of magnitude, it also exhibit specific characteristics compared to traditional media, such as it's collaborative editing to which a large fraction of humanity participates, even marginally, it's complex dynamics and the paradoxical nature of traces it conveys, both ubiquitous and fragile at the same time.

These unique features also led the web to become a major source for modern information, analysis and study, and the capacity to preserve its memory an important issue for the future. But these features also require to lay new methodological and practical foundations in the well-established field of cultural artefacts preservation.

This presentation will outline the salient properties of the Web viewed from the somewhat different angle of its preservation and offer some insight into how its memory can be built to serve science in the future.

Julien Masanès is Director of the Internet Memory, a non-profit foundation for web preservation and digital cultural access. Before this he directed the Web Archiving Project at the Bibliothèque Nationale de France since 2000. He also actively participated in the creation of the International Internet Preservation Consortium (IIPC), which he has coordinated during the first two years. He contributes in various national and international initiatives and provides advices for the European Commission as an expert in the domain of digital preservation and web archiving. He has also launched and presently chairs the International Web Archiving Workshop (IWAW) series, the main international rendezvous in this field.

Julien Masanès studied Philosophy and Cognitive Science, gaining his MS in Philosophy from the Sorbonne in 1992 and his MS in Cognitive Science from the Ecole des Hautes Etudes en Sciences Sociales (EHESS) in 1994. In 2000 he gained a MS in librarianship at the Ecole Nationale Supérieure des Sciences de l'information et des Bibliothèques (ENSSIB).

Static Analysis and Verification, Victor Vianu

Correctness and good performance are essential desiderata for database systems and the many applications relying on databases. Indeed, bugs and performance problems are commonly encountered in such systems and can range from annoying to catastrophic. Static analysis and verification provide tools for automatic reasoning about queries and applications in order to guarantee desirable behavior. Unfortunately, such reasoning, carried out by programs that take as input other programs, quickly runs against fundamental limitations of computing. In the cases when it is feasible, it often requires a sophisticated mix of techniques from logic and automata theory. This talk will discuss some of the challenges and intrinsic limitations of static analysis and verification and identify situations where it can be very effective.

Victor Vianu is a Professor of Computer Science at the University of California, San Diego. He received his PhD in Computer Science from the University of Southern California in 1983. He has spent sabbaticals at INRIA, Ecole Normale Superieure (Cachan and Ulm) and Telecom Paris. Vianu's interests include database theory, computational logic, and Web data. His most recent research focuses on static analysis of XML-based systems, and specification and verification of data-driven Web services and workflows. Vianu's publications include over 100 research articles and a graduate textbook on database theory. He received the PODS Alberto Mendelzon Test-of-Time Award in 2010 and has given numerous invited talks including keynotes at PODS, ICDT, STACS, the Annual Meeting of the Association of Symbolic Logic, and the Federated Logic Conference. Vianu has served as General Chair of SIGMOD and PODS, and Program Chair of the PODS and ICDT conferences. He is currently Editor-in-Chief of the Journal of the ACM and Area Editor of ACM Transactions on Computational Logic. He was elected Fellow of the ACM in 2006.

Data crowdsourcing, Tova Milo

Crowd-based data sourcing is a new and powerful data procurement paradigm that engages Web users to collectively contribute data, analyze information and share opinions. Crowd-based data sourcing democratizes data-collection, cutting companies' and researchers' reliance on stagnant, overused datasets and bears great potential for revolutionizing our information world. Yet, triumph has so far been limited to only a handful of successful projects such as Wikipedia or IMDb. This comes notably from the difficulty of managing huge volumes of data and users of questionable quality and reliability. Every single initiative had to battle, almost from scratch, the same non-trivial challenges. The ad hoc solutions, even when successful, are application specific and rarely sharable. In this talk we consider the development of solid scientific foundations for Web-scale data sourcing. We believe that such a principled approach is essential to obtain knowledge of superior quality, to realize the task more effectively and automatically, be able to reuse solutions, and thereby to accelerate the pace of practical adoption of this new technology that is revolutionizing our life. We will consider the logical, algorithmic, and methodological foundations for the management of large scale crowd-sourced data as well as the the development of applications over such information.

Tova Milo received her Ph.D. degree in Computer Science from the Hebrew University, Jerusalem, in 1992. After graduating she worked at the INRIA research institute in Paris and at University of Toronto and returned to Israel in 1995, joining the School of Computer Science at Tel Aviv university where she is now a full Professor and Department head. Her research focuses on advanced database applications such as data integration, XML and semi-structured information, Web-based applications and Business Processes, studying both theoretical and practical aspects. Tova served as the Program Chair of several international conferences, including PODS, ICDT, VLDB, XSym, and WebDB. She is a member of the VLDB Endowment and the ICDT executive board and is an editor of TODS, the VLDB Journal and the Logical Methods in Computer Science Journal. She has received grants from the Israel Science Foundation, the US-Israel Binational Science Foundation, the Israeli and French Ministry of Science and the European Union. She is a recipient of the 2010 ACM PODS Alberto O. Mendelzon Test-of-Time Award and of the prestigious EU ERC Advanced Investigators grant.

Extracting Data from the Web, Georg Gottlob

This talk deals with the problem of semi-automatically and fully automatically extracting data from the Web. Data on the web are usually presented to meet the eye, and are not structured. To use these data in business data processing applications, they need to be extracted and structured. In the first part of this seminar, the need of web data extraction is illustrated using examples from the business intelligence area. In the second part, a theory of web data extraction based on monadic second-order logic and monadic datalog is presented, and some complexity results are discussed. The third part of this talk briefly illustrates the Lixto tool for semi-automatic data extraction. This datalog-based tool has been used for a variety of commercial applications. Finally, in the fourth part of the talk we discuss the problem of fully automated data extractions from domain-specific web pages and present first results of the DIADEM project, which is funded by an ERC advanced grant at Oxford University.

Georg Gottlob is a Professor of Informatics at Oxford University, a Fellow of St John's College, Oxford, and an Adjunct Professor at TU Wien. His interests include data extraction, database theory, graph decomposition techniques, AI, knowledge representation, logic and complexity. Gottlob has received the Wittgenstein Award from the Austrian National Science Fund, is an ACM Fellow, an ECCAI Fellow, a Fellow of the Royal Society, and a member of the Austrian Academy of Sciences, the German National Academy of Sciences, and the Academia Europaea. He chaired the Program Committees of IJCAI 2003 and ACM PODS 2000, was the Editor in Chief of the Journal Artificial Intelligence Communications, and is currently a member of the editorial boards of journals, such as CACM and JCSS. He is the main founder of Lixto (www.lixto.com ), a company that provides tools and services for web data extraction. Gottlob was recently awarded an ERC Advanced Investigator's Grant for the project "DIADEM: Domain-centric Intelligent Automated Data Extraction Methodology" (see also http://diadem.cs.ox.ac.uk/ ) . More information on Georg Gottlob can be found on his Web page: http://www.cs.ox.ac.uk/people/georg.gottlob/

Knowledge Harvesting from the Web, Gerhard Weikum

The proliferation of knowledge-sharing communities such as Wikipedia and the progress in scalable information extraction from Web and text sources has enabled the automatic construction of very large knowledge bases. Recent endeavors of this kind include academic research projects such as DBpedia, KnowItAll, ReadTheWeb, and YAGO-NAGA, as well as industrial ones such as Freebase and Trueknowledge. These projects provide automatically constructed knowledge bases of facts about named entities, their semantic classes, and their mutual relationships. Such world knowledge in turn enables cognitive applications and knowledge-centric services like disambiguating natural-language text, deep question answering, and semantic search for entities and relations in Web and enterprise data.

This talk discusses recent advances, research opportunities, and open challenges along this avenue of knowledge harvesting and its applications. Gerhard Weikum is a Scientific Director at the Max Planck Institute for Informatics in Saarbruecken, Germany, where he is leading the department on databases and information systems. He is also an Adjunct Professor at Saarland University, and a principal investigator of the DFG Cluster of Excellence on Multimodal Computing and Interaction. Earlier he held positions at Saarland University in Saarbruecken, Germany, at ETH Zurich, Switzerland, at MCC in Austin, Texas, and he was a visiting senior researcher at Microsoft Research in Redmond, Washington. He graduated from the University of Darmstadt, Germany.

Gerhard Weikum's research spans transactional and distributed systems, self-tuning database systems, DB&IR integration, and the automatic construction of knowledge bases from Web and text sources. He co-authored a comprehensive textbook on transactional systems, received the VLDB 10-Year Award for his work on automatic DB tuning, and is one of the creators of thee Yago knowledge base. Gerhard Weikum is an ACM Fellow, a Fellow of the German Computer Society, and a member of the German Academy of Science and Engineering. He has served on various editorial boards, including Communications of the ACM, and as program committee chair of conferences like ACM SIGMOD, Data Engineering, and CIDR. From 2003 through 2009 he was president of the VLDB Endowment. He received the ACM SIGMOD Contributions Award in 2011.

Reasoning on Web Data Semantics, Marie-Christine Rousset

Providing efficient and high-level services for integrating, querying and managing Web data raises many difficult challenges, because data are becoming ubiquitous, multi-form, multi-source and musti-scale. Data semantics is probably one of the keys for attacking those challenges in a principled way. A lot of effort has been done in the Semantic Web community for describing the semantics of information through ontologies.In this talk, we will show that description logics provide a good model for specifying ontologies over Web data (described in RDF), but that restrictions are necessary in order to obtain scalable algorithms for checking data consistency and answering conjonctive queries. We will show that the DL-Lite family has good properties for combining ontological reasoning and data management at large scale, and is then a good candidate for beeing a Semantic Web data model.

Marie-Christine Rousset is a Professor of Computer Science at the University of Grenoble. She is an alumni of The Ecole normale supÈrieure (Fontenay-aux-Roses) from which she graduated in Mathematics (1980). She obtained a PhD (1983) and a ThËse d'Etat (1988) in Computer Science from UniversitÈ Paris-Sud. Her areas of research are Knowledge Representation, Information Integration, Pattern Mining and the Semantic Web. She has published over 90 refereed international journal articles and conference papers, and participated in several cooperative industry-university projects. She received a best paper award from AAAI in 1996, and has been nominated ECCAI fellow in 2005. She has served in many program committees of international conferences and workshops and in editorial boards of several journals. She has been junior member of IUF (Institut Universitaire de France) from 1997 to 2001, and has just been nominated in 2011 as a senior member of IUF for developing a five-year research project on Artificial Intelligence and the Web.

Social networks, Pierre Senellart

Social networking services on the Web are a tremendously popular way to connect with friends, publish content, and share information. We will talk about some of the research challenges they present: 1) How to crawl, index, and query social networks? 2) How to explain the particular small-world characteristics of social networking graphs? 3) How to use social connections to improve the quality of Web search or recommendations?

Dr. Pierre Senellart is an Associate Professor in the DBWeb team at Télécom ParisTech, the French leading engineering school specialized in information technology. He is an alumnus of the

École normale supérieure and obtained his M.Sc. (2003) and his Ph.D. (2007) in computer science from Université Paris-Sud , studying under the supervision of Serge Abiteboul. Pierre Senellart has published articles in internationally renowned conferences and journals (PODS, AAAI, VLDB Journal, Journal of the ACM, etc.) He has been a member of the program committee and participated in the organization of various international conferences and workshops (including WWW, CIKM, ICDE, VLDB, SIGMOD, ICDT). He is also the Information Director of the Journal of the ACM. His research interests focus around theoretical aspects of database management systems and the World Wide Web, and more specifically on the intentional indexing of the deep Web, probabilistic XML databases, and graph mining. He also has an interest in natural language processing, and has been collaborating with SYSTRAN, the leading machine translation company.