FindAPhD Weekly PhD Newsletter | JOIN NOW FindAPhD Weekly PhD Newsletter | JOIN NOW

Learning and deploying predictive models of data provenance

   Department of Informatics

This project is no longer listed on and may not be available.

Click here to search for PhD studentship opportunities
  Prof L Moreau, Dr Albert Meroño-Peñuela  Applications accepted all year round  Funded PhD Project (UK Students Only)

About the Project

Our modern lives are increasingly governed by ubiquitous intelligent systems and an abundance of digital data. More and more products and services are providing us with better tools and recommendations for our professional, personal, and entertainment activities. A typical example is how we rely on data from search engine result pages and “knowledge panels” from knowledge graphs to satisfy our information needs. With the clear impact that such systems have in our decision making processes, we ask ourselves more frequently: how much can we trust a system’s recommendation? Where does the data on which the system based its decision come from? How did these data travel, and under which transformations, from their source to the user?

To answer these questions, various systems and models for data provenance representation, capture, and analysis have been proposed [1,2,3]. In general, these models provide structured representations and ontologies to express fine-grained semantic relationships between data entities, activities, and agents involved in data workflows. Data provenance tracking systems can then use these models to record large-scale data provenance traces.

However, recording provenance traces at such fine-grained level often comes with two severe limitations: (i) real-time provenance tracking typically has a high performance cost due to its high volume and verbosity; and (ii) provenance information that was not recorded at execution time will be inherently lost. How can we design provenance models that address these limitations, and that are able to learn and use statistical patterns from past provenance traces to recreate possible future workflows?

This PhD project combines provenance representation with machine-learning architectures that have recently been successfully deployed in fields like natural language processing for constructing large language models, adapting and using them to learn models for sub-symbolically representing and predicting provenance. In particular, it proposes methods to learn provenance from large amounts of past structured provenance data [3], by representing provenance information through knowledge graph embeddings [4] that are suitable for GPU processing; and predicting and reconstructing provenance with Transformer-based architectures [5]. These methods can be further applied in the project to better analyse and understand provenance workflows, predict and complete partial provenance statements in real-time systems, and reconstruct missing provenance information.

Application details:

To apply:To be considered for the position candidates must apply via King’s Apply online application system ( Further details are available at

Please indicate Professor Luc Moreau & Dr Albert Meroño-Peñuela as the supervisor and quote Project: “Learning and deploying predictive models of data provenance” in your application and all correspondence.

The selection process will involve a pre-selection on documents and, if selected, will be followed by an invitation to an interview. If successful at the interview, an offer will be provided in due time.


[1] Groth, P., Jiang, S., Miles, S., Munroe, S., Tan, V., Tsasakou, S. and Moreau, L., 2006. An architecture for provenance systems.
[2] Missier, P., Belhajjame, K. and Cheney, J., 2013, March. The W3C PROV family of specifications for modelling provenance metadata. In Proceedings of the 16th International Conference on Extending Database Technology (pp. 773-776).
[3] Kuhn, T., Meroño-Peñuela, A., Malic, A., Poelen, J.H., Hurlbert, A.H., Ortiz, E.C., Furlong, L.I., Queralt-Rosinach, N., Chichester, C., Banda, J.M. and Willighagen, E., 2018, October. Nanopublications: a growing resource of provenance-centric scientific linked data. In 2018 IEEE 14th International Conference on e-Science (e-Science) (pp. 83-92). IEEE.
[4] Lin, Y., Liu, Z., Sun, M., Liu, Y. and Zhu, X., 2015, February. Learning entity and relation embeddings for knowledge graph completion. In Twenty-ninth AAAI conference on artificial intelligence.
[5] Devlin, J., Chang, M.W., Lee, K. and Toutanova, K., 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
PhD saved successfully
View saved PhDs