Affordable Access

The Origin of Data: Enabling the Determination of Provenance in Multi-institutional Scientific Systems through the Documentation of Processes

Publication Date
  • Computer Science
  • Design
  • Engineering
  • Mathematics
  • Medicine


The Oxford English Dictionary defines provenance as (i) the fact of coming from some particular source or quarter; origin, derivation. (ii) the history or pedigree of a work of art, manuscript, rare book, etc.; concr., a record of the ultimate derivation and passage of an item through its various owners. In art, knowing the provenance of an artwork lends weight and authority to it while providing a context for curators and the public to understand and appreciate the work’s value. Without such a documented history, the work may be misunderstood, unappreciated, or undervalued. In computer systems, knowing the provenance of digital ob jects would provide them with greater weight, authority, and context just as it does for works of art. Specifically, if the prove- nance of digital ob jects could be determined, then users could understand how documents were produced, how simulation results were generated, and why decisions were made. Provenance is of particular importance in science, where experimental results are reused, reproduced, and verified. However, science is increasingly being done through large-scale collaborations that span multiple institutions, which makes the problem of determining the provenance of scientific results significantly harder. Current approaches to this problem are not designed specifically for multi-institutional scien- tific systems and their evolution towards greater dynamic and peer-to-peer topologies. Therefore, this thesis advocates a new approach, namely, that through the autonomous creation, scalable recording, and principled organisation of documentation of systems’ processes, the determina- tion of the provenance of results produced by complex multi-institutional scientific systems is enabled. The dissertation makes four contributions to the state of the art. First is the idea that provenance is a query performed over documentation of a system’s past process. Thus, the problem is one of how to collect and collate documentation from multiple distributed sources and organise it in a manner that enables the provenance of a digital ob ject to be determined. Second is an open, generic, shared, principled data model for documentation of processes, which enables its collation so that it provides high-quality evidence that a system’s processes occurred. Once documentation has been created, it is recorded into specialised repositories called provenance stores using a formally specified protocol, which ensures documentation has high- quality characteristics. Furthermore, patterns and techniques are given to permit the distributed deployment of provenance stores. The protocol and patterns are the third contribution. The fourth contribution is a characterisation of the use of documentation of process to answer questions related to the provenance of digital ob jects and the impact recording has on application performance. Specifically, in the context of a bioinformatics case study, it is shown that six different provenance use cases are answered given an overhead of 13% on experiment run- time. Beyond the case study, the solution has been applied to other applications including fault tolerance in service-oriented systems, aerospace engineering, and organ transplant management.

There are no comments yet on this publication. Be the first to share your thoughts.