Affordable Access

Domain knowledge and functions in data science, application to hydroelectricity production

Authors
  • Faure-Giovagnoli, Pierre
Publication Date
Nov 24, 2023
Source
HAL-Descartes
Keywords
Language
English
License
Unknown
External links

Abstract

In this dissertation, we investigate the link between domain knowledge in the form of functions and data science. Consider the following scenario. Let D(y,z1,...,zn) be a dataset, Alice a data scientist, Bob a domain expert and y=f(z1,...,zn) a function known to Bob from his background knowledge. In this dissertation, we are interested in the following simple yet crucial questions for Alice. How to define the satisfaction of f in D? How to measure that satisfaction efficiently? How does this satisfaction relate to the supervised learning task of learning f from D? It turns out that these problems are related to the study of counterexamples through the use of functional dependencies (FDs) and, in particular, FD measures used to quantify their satisfaction in a dataset such as the g3 indicator. More specifically, we consider the case where the equality is replaced by more flexible predicates, a relaxation that is now common in the literature. First, we examine the complexity of computing g3. It is known that g3 can be computed in polynomial time when using equality, while it becomes NP-hard when using general predicates. Our goal is to refine this dichotomy by studying the impact of the following common properties: reflexivity, transitivity, symmetry, and antisymmetry. We show that symmetry and transitivity together are sufficient to guarantee that the g3 can be computed in polynomial time. However, removing one of them makes the problem NP-hard. Second, we study the computation of g3 in the polynomial and NP-hard cases identified previously. We propose different exact and approximate solutions for the computation of g3 in both cases. We compare these solutions in a detailed experimental study of time performance and approximation accuracy. All the algorithms are also made available via fastg3, a fast open-source Python library with an underlying C++ implementation. Third, we link counterexamples and g3 to supervised learning with a web application called ADESIT. ADESIT is intended to be part of an iterative data refinement process right after data selection and just before the machine learning process itself. It provides a way to evaluate the ability of a dataset to perform well for a given supervised learning problem through statistical and visual exploration. Finally, we validate our approach by applying it to the industrial problem of air gap monitoring in compact hydro-generators, and develop a solution for automatically processing the recorded data.

Report this publication

Statistics

Seen <100 times