Outils logiciels pour les cours Paris II

Cours Paris II

Stages/ Thèses/ Séminaires

Laboratoire

edit SideBar

Seminaires

DATABASES and DATASCIENCE

2021-2022

  • Mercredi 8 Juin 14h-17h
    • 14h Michel de Rougemont: Embeddings and probabilistic relations

On montrera le lien entre les relations probabilistes et les plongements de petite dimension, qui permettent de définir des modèles d'incertitude qui garantissent une qualité des réponses avec grande probabilité. Ces approches éclairent la difficulté de définir la qualité des prédictions et des techniques d'I.A. "vérifiable" en général. Le sujet s'inspire d'un article de G. van den Broeck.

  • 14h30 Achraf Lassoued: Embeddings et Prédiction en langage naturel

On décrit les plongements pour les mots (Word2vec) et les phrases (SBERT) qui sont récemment apparus en NLP. On montrera comment utiliser ces outils pour réaliser des prédictions.

  • 15H Guilhem Piat: Enrichissement de plongements contextualisés en connaissances biomédicales: KnowBert-UMLS

A l'heure actuelle, les logiciels de traitement automatique du langage sont rarement suffisamment performants dans les domaines de spécialité comme le biomédical pour apporter le gain de productivité espéré. Parallèlement, il existe des ontologies ou bases de connaissances dans ces domaines qui répertorient de nombreux concepts d'intérêt. Nous cherchons à exploiter ces ressources dans le domaine médical et intégrer ces connaissances dans un modèle de langue Transformer contextualisé basé sur BERT de manière à combler l'écart de performances entre domaines général et biomédical.

  • 15H30 Richard Lassaigne: Probabilistic Circuits

Les circuits probabilistes, introduits par G. van den Broeck, essayent de caractériser les distributions calculables à partir d'autres distributions. Cette notion théorique est importante pour représenter les dépendances statistiques.

  • Jeudi 17 Mars 2022, 14h-17h
    • 14h-. Dominique Laurent, Nicolas Spyratos: Consistent and Inconsistent Data

We study the query answering process in tables with nulls and functional dependencies. Given such a table, we use a variant of the chase procedure to `augment' the table (a) by adding to it all values that can be inferred from the functional dependencies and (b) by identifying all pairs of tuples that are in conflict with one or more dependencies.

This process allows to associate each tuple in the `augmented' table with a `status', namely consistent or inconsistent. Based on the augmented table we propose (a) a method for consistent query answering and (b) quality measures for the data in the table, as well as for the data in consistent query answers.

  • R. Lassaigne: Classical bounds for probabilistic algorithms
 The objective is to present some tools for the design of probabilistic approximation algorithms. 

Two types of general bounds on the tail of the distribution of the sum of independent random variables are useful: Chernoff's and Hoeffding's inequalities. For applications to random variables which are not independent, martingales are a powerful tool for bounding the divergence of a random variable from its expected value. The Azuma's inequality generalizes the preceding bounds in this case.

  • Michel de Rougemont: Estimation of the quality of Data in Data-Exchange

Given a table with nulls and functional dependencies, there are several chase procedures to qualify the data as true, inconsistent or unknown. Let $\delta$ be the associated distribution over these 3 possibilities, for example 60% true, 10% inconsistent, 30% unknown.

We first show how to estimate this distribution by sampling uniformly the data, and applying the chase on the samples only.

We then modify the chase, into a "conservative chase", where we replace the inconsistent values by unknown. The distribution could then be 70% true, 30% unknown. We then study how to consider two different sources on the same schema: S1 is 70% true, 30% unknown, S2 is 50% true, 50% unknown. How should we exchange these data? The goal is to show that the proportion of true should increase for each source.

  • Jeudi 14 Octobre 2021, 14h-17h
    • 14h-15h. Dominique Laurent: Handling Inconsistencies in Tables with Nulls and Functional Dependencies. In this work we address the problem of handling inconsistencies in tables with missing values (or nulls) and functional dependencies. Although the traditional view is that table instances must respect all functional dependencies imposed on them, it is nevertheless relevant to develop theories about how to handle instances that violate some dependencies. The usual approach to alleviate the impact of inconsistent data on the answers to a query is to introduce the notion of repair: a repair is a minimally different consistent instance and an answer is consistent if it is present in every repair. Our approach is fundamentally different: we use set theoretic semantics for tuples and functional dependencies that allow us to associate each tuple with a truth value among the following: true, false, inconsistent or unknown. The users of the table can then query the set of true tuples as usual. Regarding missing values, we make no assumptions on their existence: a missing value exists only if it is inferred from the functional dependencies of the table. In our presentation, we first introduce our approach to handle inconsistencies in a table with nulls and functional dependencies, and we outline algorithms for computing all true, inconsistent and false tuples.
    • 15h-16h. Michel de Rougemont. Data Exchange as Equilibria. Given different inconsistent Sources of Data, which include Null values, we define a center (equilibrium) which minimizes some potential function using natural distances between the Sources. We imagine a population protocol where a Source S encounters random sources S_1...S_k and may decide to update S using the center of the samples. The goal is to show that if all Sources follow this protocol, they will converge to the Equilibrium.

2019-2020

  • Mercredi 24 Juin 2020, 10h-12h, Plateforme BBB
    • 10h-1Oh30. Achraf Lassoued: Preferences as distributions on streaming data. Consider a stream of tuples for a Datawarehouse with several measures M1,...Mk. A preference is a distribution on the measures, i.e. a new measure which is a linear combination of M1,...Mk. How can be answer OLAP queries on this new measure? We show that weighted Reservoir sampling allows efficient answers even if we know the distribution after we take the samples. The same technique applies to text, with several Word2Vec systems. A preference is a distribution on these systems and we can analyse the Text with similar samples, after the distribution is known.
    • 10h30-11h. Michel de Rougemont: Null values and Statistics. Assume your database satisfies some statistical laws but has Null values. How could you use the Statistics to approximate Queries by guessing the Null values and reduce the errors in the answers? We will consider examples with random graphs which may follow a power law degree distribution. We show the connection with differentiable privacy and learning from statistical queries.
  • Mercredi 18 décembre 2019 CRED, 9h-12h30
    • 9h30-10h-15. Nicolas Spyratos: Visual Data Exploration in the Hifun Language
    • 10h15-11h. Dominique Laurent 4-valued semantics for databases
    • 11h-11h30. Achraf Lassoued: Topics in a streaming text
    • 11h30-12h. Michel de Rougemont: Null values , certain answers, good answers in databases

2018-2019

  • Mardi 21 Mai 2019 CRED, 14h-17h
    • 14h, N. Sypratos, University Paris South. Data Exploration in the HIFUN Language (ongoing work)

Abstract. When big data sets are stored in databases and data warehouses data exploration usually involves ad hoc querying and data visualization to identify potential relationships or insights that may be hidden in the data. The objective of this work is to provide support for these activities in the context of HIFUN, a high level functional language of analytic queries proposed recently by the authors. Our contributions are: (a) we show that HIFUN queries can be partially ordered and this allows the analyst to drill down or roll up from a given query during data exploration, and (b) we introduce a visualization algebra that allows the analyst to create desirable visualizations of query results.

  • 14h45, D. Laurent, Université de Cergy-Pontoise. Four-Valued Semantics for Data Integration under the OWA: a Deductive Database Approach

In this work, we consider the standard scenario whereby several data sources are to be integrated into a single database. In this scenario, all data sources are seen as consistent sets of facts meant to be either true or false. Integrating such data sources means for us combining their content so as to build a consistent database over which queries are answered using rules a la Datalog. In this scenario four cases are possible for a candidate fact ϕ: (1) all data sources that contain ϕ state that ϕ is true, (2) all data sources that contain ϕ state that ϕ is false, (3) some data sources containing ϕ state that ϕ is true and the other data sources containing ϕ state that ϕ is false, (4) no data source contains ϕ.

The most intuitive and well-known way to take this situation into account is the four-value logics introduced by Belnap where the four truth values are t (expressing truth), f (expressing falsity), b (expressing contradiction) and n (expressing absence of knowledge).

In our approach, the result of an integration for a fact ϕ is a pair (ϕ, v) where v is one of the truth values listed above, thus reflecting the OWA in the sense that the absence of information does not mean falsity. The integrated database D is then a pair (E, R) where E is a set of pairs such that the truth value is different from n and where R is a set of generalized DatalogNeg rules, that is rules whose body is a set of positive or negative atoms and whose head is a positive or negative atom. Since these rules allow to derive true facts and false facts, contradicting derivations are possible, and taken into account in our semantics.

In this setting, we show that every database D = (E, R) has non unique minimal models (with respect to set inclusion) and we define an operator (inspired from the immediate consequence operator of Datalog) for computing one of these minimal models, which we call the semantics of D. Our choice is based on the intuition that, for an implication to hold, whenever possible, truth (resp. contradiction) should imply truth (resp. contradiction).

  • 15h30. Achraf Lassoued, Univ. Paris II. Topic modelling with LDA and social networks.
  • 16h15. Michel de Rougemont, Univ. Paris II and IRIF. Some useful decompositions.
  • Mardi 27 Novembre 2018: CRED, 14h-17h
    • 14h, Nicolas Spyratos, "DROLL: A Formal Framework for Visualizing and Exploring

Analytic Query Results":In recent work we introduced HIFUN, a functional query language for studying analytic queries in the abstract and provided algorithms for mapping HIFUN queries as queries in lower level evaluation mechanisms (SQL, SPARQL, MapReduce Platforms, etc.). In our current work we show that the HIFUN queries form a complete lattice and we use this lattice to propose a formal framework that allows a user to perform three operations: - visualize : focus on a query Q, evaluate it and visualize its result or part thereof - drill-down : move to a query more informative (finer) than Q - roll-up : move to a query less informative (coarser) than Q

 A sequence of the above operations is called a visual exploration and the set of all visual explorations is the DROLL language. 
  • 14h45-15h. Guillaume Vimont (PhD candidate): The content Correlation of streaming edges, (15 mins)
  • 15h15-16h. Michel de Rougemont: Recommendation Systems.

Abstract: Let A be a large (m,n) matrix specifying that customer i likes product j. Assume A is close to a low-rank matrix and that we only know 1% of the values of A, the others are "null". Can we recover A? More precisely, can we sample an unknown value of of a row A(i)? This is precisely what Amazon et Netflix do! In DataScience, even if we only know a tiny fraction of the data, it is enough, contrary to Databases.

2017-2018

DATABASES and DATASCIENCE

  • Lundi 18 décembre 2017: CRED, 14h-17h
    • 14h, Nicolas Spyratos, Query rewriting for MapReduce jobs and Group-by queries.
    • 15h, Dominique Laurent, Mining Multi-Dimensional and Multi-Level Sequential Patterns.

Considering that panel data on one hand and multi-dimensional and multi-level sequential patterns on the other hand, deal with data with similar features, I’ll present previous work on sequential pattern mining. The question is then to see how these frequent patterns could help in achieving better predictions in the context of panel data.

  • 16H, Michel de Rougemont, Prediction & Integration of Time Series.

I'll review basic techniques to make predictions in the context of panel data. I will then argue for the integration of panel data with external information. Twitter can be viewed as a global filter which aggregates the outside world and provides such external information. The challenge is to show how to achieve better predictions. A case study are Time series for the cryptocurrencies such as Bitcoins or Ripples correlated with #Bitcoin, #Ripple, #xrp on Twitter.

2016-2017

DATABASES versus DATASCIENCE

  • Mercredi 31 Mai 2017: salle 06 à Assas, 13h15-15h30
    • 13h15-14h, N. Spyratos. Formal Approach to Rewriting Map-Reduce Jobs and SQL Group-by Queries.

We use the HiFun language presented in our previous meeting as a common formal framework for studying rewriting of analytic queries, whether they are expressed as map-reduce jobs or as SQL Group-by queries. To do this, we first show that there are 1-1 mappings between Hifun and Map-Reduce Jobs on the one hand and between HiFun and SQL Group-by queries on the other. We then derive a rewriting sheme that works as follows: 1. We abstract the map-reduce job or the group-by query as a HiFun query. 2. We rewrite the HiFun abstraction (using the rewriting rules of HiFun). 3. We encode the rewritten HiFun abstraction in map-reduce or in SQL.

  • 14h-14h45, D. Laurent. Update Semantics for Databases satisfying Tuple Generating Dependencies

TGD satisfaction implies in many cases to allow null values in the database instance. Whereas query answering in this framework has been the subject of significant research work (for instance in the domain of graph databases with blank nodes or in the domain of data transfer), the problem of updates is still open. In this presentation, we first motivate our work with examples and then we outline our approach for inserting or deleting sets of tuples in this framework.

  • 14h-45-15h30, M. de Rougemont. Approximate integration of streaming graph edges

For a stream of graph edges from a Social Network, we approximate the communities as the large connected components of the edges in a reservoir sampling, without storing the entire edges of the graph. We show that for a model of random graphs which follows a power law degree distribution, the community detection algorithm is a good approximation. Given two streams of graph edges from two Sources, we define the Community Correlation as the fraction of the nodes in communities in both streams. Although we do not store the edges of the streams, we can approximate the Community Correlation and define the Integration of two streams. We illustrate this approach with Twitter streams. We then study the extension to spanning trees.

  • Lundi 27 Février: 14h-17 au CRED: Centre de Recherche en Economie du Droit, http://cred.u-paris2.fr/ 21, Rue Valette (next to the Panthéon), 75005 PARIS
    • 14h-14h40, Nicolas Spyratos (University Paris South): Big Data Analytics - From SQL to Hadoop and beyond.

I will present HiFun, a high level functional query language for data analytics and will show how one can use this language to give the formal definition of a recursive analytic query, as well as of result visualization and visual result exploration. This is common work conducted jointly with Tsuyoshi Sugibuchi of Custoemer Matrix.

  • 14h40-15h20, T. S. (Buchi): TBD
  • 15h20-16h, Dominique Laurent (University Cergy-Pontoise, joint work with Mirian Halfeld Ferrari Alves (LIFO)): Sur les mises à jour des bases de données RDF(S)

Parmi les nouveaux modèles de bases de données connus sous le nom de NoSQL, RDF(S) se situe parmi les plus répandus. Ce modèle fait ainsi l’objet de nombreuses recherches notamment en ce qui concerne sa sémantique et les langages de requêtes associés. Toutefois, la problématique des mises à jour des bases de données RDF(S) pose de nouveaux défis liés au modèle. L’exposé sera consacré à nos travaux actuels sur les mises à jour dans le contexte des bases de données déductives qui généralise celui des bases RDF(S). Après avoir introduit présenté les principales caractéristiques de notre approche, nous montrerons comment celle-ci s’applique au cas de RDF(S) puis nous discuterons certaines extensions possibles de notre travail.

  • 16h-16h40, Michel de Rougemont (University Paris II and IRIF): Uncertainty in Data: probabilistic databases versus probabilistic algorithms.

Uncertainty can be treated in many different ways. I'll review the recent results of D. Suciu (VLDB 2012) on probabilistic databases and contrast the approach with approximate randomized algorithms which guarantee the quality of the approximation.

UP2