Our modern lives are increasingly governed by ubiquitous intelligent systems and an abundance of digital data. More and more products and services are providing us with better tools and recommendations for our professional, personal, and entertainment activities. A typical example is how we rely on data from search engine result pages and “knowledge panels” from knowledge graphs to satisfy our information needs. With the clear impact that such systems have in our decision making processes, we ask ourselves more frequently: how much can we trust a system’s recommendation? Where does the data on which the system based its decision come from? How did these data travel, and under which transformations, from their source to the user?
To answer these questions, various systems and models for data provenance representation, capture, and analysis have been proposed [1,2,3]. In general, these models provide structured representations and ontologies to express fine-grained semantic relationships between data entities, activities, and agents involved in data workflows. Data provenance tracking systems can then use these models to record large-scale data provenance traces.
However, recording provenance traces at such fine-grained level often comes with two severe limitations: (i) real-time provenance tracking typically has a high performance cost due to its high volume and verbosity; and (ii) provenance information that was not recorded at execution time will be inherently lost. How can we design provenance models that address these limitations, and that are able to learn and use statistical patterns from past provenance traces to recreate possible future workflows?
This PhD project combines provenance representation with machine-learning architectures that have recently been successfully deployed in fields like natural language processing for constructing large language models, adapting and using them to learn models for sub-symbolically representing and predicting provenance. In particular, it proposes methods to learn provenance from large amounts of past structured provenance data [3], by representing provenance information through knowledge graph embeddings [4] that are suitable for GPU processing; and predicting and reconstructing provenance with Transformer-based architectures [5]. These methods can be further applied in the project to better analyse and understand provenance workflows, predict and complete partial provenance statements in real-time systems, and reconstruct missing provenance information.
Application details:
To apply:To be considered for the position candidates must apply via King’s Apply online application system (https://apply.kcl.ac.uk/). Further details are available at https://www.kcl.ac.uk/informatics/postgraduate/research-degrees
Please indicate Professor Luc Moreau & Dr Albert Meroño-Peñuela as the supervisor and quote Project: “Learning and deploying predictive models of data provenance” in your application and all correspondence.
The selection process will involve a pre-selection on documents and, if selected, will be followed by an invitation to an interview. If successful at the interview, an offer will be provided in due time.