Outils logiciels pour les cours Paris II
Cours Paris II
Stages/ Thèses/ Séminaires |
SeminairesDATABASES and DATASCIENCE 2021-2022
On montrera le lien entre les relations probabilistes et les plongements de petite dimension, qui permettent de définir des modèles d'incertitude qui garantissent une qualité des réponses avec grande probabilité. Ces approches éclairent la difficulté de définir la qualité des prédictions et des techniques d'I.A. "vérifiable" en général. Le sujet s'inspire d'un article de G. van den Broeck.
On décrit les plongements pour les mots (Word2vec) et les phrases (SBERT) qui sont récemment apparus en NLP. On montrera comment utiliser ces outils pour réaliser des prédictions.
A l'heure actuelle, les logiciels de traitement automatique du langage sont rarement suffisamment performants dans les domaines de spécialité comme le biomédical pour apporter le gain de productivité espéré. Parallèlement, il existe des ontologies ou bases de connaissances dans ces domaines qui répertorient de nombreux concepts d'intérêt. Nous cherchons à exploiter ces ressources dans le domaine médical et intégrer ces connaissances dans un modèle de langue Transformer contextualisé basé sur BERT de manière à combler l'écart de performances entre domaines général et biomédical.
Les circuits probabilistes, introduits par G. van den Broeck, essayent de caractériser les distributions calculables à partir d'autres distributions. Cette notion théorique est importante pour représenter les dépendances statistiques.
We study the query answering process in tables with nulls and functional dependencies. Given such a table, we use a variant of the chase procedure to `augment' the table (a) by adding to it all values that can be inferred from the functional dependencies and (b) by identifying all pairs of tuples that are in conflict with one or more dependencies. This process allows to associate each tuple in the `augmented' table with a `status', namely consistent or inconsistent. Based on the augmented table we propose (a) a method for consistent query answering and (b) quality measures for the data in the table, as well as for the data in consistent query answers.
The objective is to present some tools for the design of probabilistic approximation algorithms. Two types of general bounds on the tail of the distribution of the sum of independent random variables are useful: Chernoff's and Hoeffding's inequalities. For applications to random variables which are not independent, martingales are a powerful tool for bounding the divergence of a random variable from its expected value. The Azuma's inequality generalizes the preceding bounds in this case.
Given a table with nulls and functional dependencies, there are several chase procedures to qualify the data as true, inconsistent or unknown. Let $\delta$ be the associated distribution over these 3 possibilities, for example 60% true, 10% inconsistent, 30% unknown. We first show how to estimate this distribution by sampling uniformly the data, and applying the chase on the samples only. We then modify the chase, into a "conservative chase", where we replace the inconsistent values by unknown. The distribution could then be 70% true, 30% unknown. We then study how to consider two different sources on the same schema: S1 is 70% true, 30% unknown, S2 is 50% true, 50% unknown. How should we exchange these data? The goal is to show that the proportion of true should increase for each source.
2019-2020
2018-2019
Abstract. When big data sets are stored in databases and data warehouses data exploration usually involves ad hoc querying and data visualization to identify potential relationships or insights that may be hidden in the data. The objective of this work is to provide support for these activities in the context of HIFUN, a high level functional language of analytic queries proposed recently by the authors. Our contributions are: (a) we show that HIFUN queries can be partially ordered and this allows the analyst to drill down or roll up from a given query during data exploration, and (b) we introduce a visualization algebra that allows the analyst to create desirable visualizations of query results.
In this work, we consider the standard scenario whereby several data sources are to be integrated into a single database. In this scenario, all data sources are seen as consistent sets of facts meant to be either true or false. Integrating such data sources means for us combining their content so as to build a consistent database over which queries are answered using rules a la Datalog. In this scenario four cases are possible for a candidate fact ϕ: (1) all data sources that contain ϕ state that ϕ is true, (2) all data sources that contain ϕ state that ϕ is false, (3) some data sources containing ϕ state that ϕ is true and the other data sources containing ϕ state that ϕ is false, (4) no data source contains ϕ. The most intuitive and well-known way to take this situation into account is the four-value logics introduced by Belnap where the four truth values are t (expressing truth), f (expressing falsity), b (expressing contradiction) and n (expressing absence of knowledge). In our approach, the result of an integration for a fact ϕ is a pair (ϕ, v) where v is one of the truth values listed above, thus reflecting the OWA in the sense that the absence of information does not mean falsity. The integrated database D is then a pair (E, R) where E is a set of pairs such that the truth value is different from n and where R is a set of generalized DatalogNeg rules, that is rules whose body is a set of positive or negative atoms and whose head is a positive or negative atom. Since these rules allow to derive true facts and false facts, contradicting derivations are possible, and taken into account in our semantics. In this setting, we show that every database D = (E, R) has non unique minimal models (with respect to set inclusion) and we define an operator (inspired from the immediate consequence operator of Datalog) for computing one of these minimal models, which we call the semantics of D. Our choice is based on the intuition that, for an implication to hold, whenever possible, truth (resp. contradiction) should imply truth (resp. contradiction).
Analytic Query Results":In recent work we introduced HIFUN, a functional query language for studying analytic queries in the abstract and provided algorithms for mapping HIFUN queries as queries in lower level evaluation mechanisms (SQL, SPARQL, MapReduce Platforms, etc.). In our current work we show that the HIFUN queries form a complete lattice and we use this lattice to propose a formal framework that allows a user to perform three operations: - visualize : focus on a query Q, evaluate it and visualize its result or part thereof - drill-down : move to a query more informative (finer) than Q - roll-up : move to a query less informative (coarser) than Q A sequence of the above operations is called a visual exploration and the set of all visual explorations is the DROLL language.
Abstract: Let A be a large (m,n) matrix specifying that customer i likes product j. Assume A is close to a low-rank matrix and that we only know 1% of the values of A, the others are "null". Can we recover A? More precisely, can we sample an unknown value of of a row A(i)? This is precisely what Amazon et Netflix do! In DataScience, even if we only know a tiny fraction of the data, it is enough, contrary to Databases. 2017-2018 DATABASES and DATASCIENCE
Considering that panel data on one hand and multi-dimensional and multi-level sequential patterns on the other hand, deal with data with similar features, I’ll present previous work on sequential pattern mining. The question is then to see how these frequent patterns could help in achieving better predictions in the context of panel data.
I'll review basic techniques to make predictions in the context of panel data. I will then argue for the integration of panel data with external information. Twitter can be viewed as a global filter which aggregates the outside world and provides such external information. The challenge is to show how to achieve better predictions. A case study are Time series for the cryptocurrencies such as Bitcoins or Ripples correlated with #Bitcoin, #Ripple, #xrp on Twitter. 2016-2017 DATABASES versus DATASCIENCE
We use the HiFun language presented in our previous meeting as a common formal framework for studying rewriting of analytic queries, whether they are expressed as map-reduce jobs or as SQL Group-by queries. To do this, we first show that there are 1-1 mappings between Hifun and Map-Reduce Jobs on the one hand and between HiFun and SQL Group-by queries on the other. We then derive a rewriting sheme that works as follows: 1. We abstract the map-reduce job or the group-by query as a HiFun query. 2. We rewrite the HiFun abstraction (using the rewriting rules of HiFun). 3. We encode the rewritten HiFun abstraction in map-reduce or in SQL.
TGD satisfaction implies in many cases to allow null values in the database instance. Whereas query answering in this framework has been the subject of significant research work (for instance in the domain of graph databases with blank nodes or in the domain of data transfer), the problem of updates is still open. In this presentation, we first motivate our work with examples and then we outline our approach for inserting or deleting sets of tuples in this framework.
For a stream of graph edges from a Social Network, we approximate the communities as the large connected components of the edges in a reservoir sampling, without storing the entire edges of the graph. We show that for a model of random graphs which follows a power law degree distribution, the community detection algorithm is a good approximation. Given two streams of graph edges from two Sources, we define the Community Correlation as the fraction of the nodes in communities in both streams. Although we do not store the edges of the streams, we can approximate the Community Correlation and define the Integration of two streams. We illustrate this approach with Twitter streams. We then study the extension to spanning trees.
I will present HiFun, a high level functional query language for data analytics and will show how one can use this language to give the formal definition of a recursive analytic query, as well as of result visualization and visual result exploration. This is common work conducted jointly with Tsuyoshi Sugibuchi of Custoemer Matrix.
Parmi les nouveaux modèles de bases de données connus sous le nom de NoSQL, RDF(S) se situe parmi les plus répandus. Ce modèle fait ainsi l’objet de nombreuses recherches notamment en ce qui concerne sa sémantique et les langages de requêtes associés. Toutefois, la problématique des mises à jour des bases de données RDF(S) pose de nouveaux défis liés au modèle. L’exposé sera consacré à nos travaux actuels sur les mises à jour dans le contexte des bases de données déductives qui généralise celui des bases RDF(S). Après avoir introduit présenté les principales caractéristiques de notre approche, nous montrerons comment celle-ci s’applique au cas de RDF(S) puis nous discuterons certaines extensions possibles de notre travail.
Uncertainty can be treated in many different ways. I'll review the recent results of D. Suciu (VLDB 2012) on probabilistic databases and contrast the approach with approximate randomized algorithms which guarantee the quality of the approximation. |