Looking to list your PhD opportunities? Log in here.
About the Project
Data is just as important as algorithms to use AI effectively and responsibly. Many AI and other data-intensive technologies use microtask crowdsourcing extensively to collect, curate, and enrich the data they need. With greater use, there is also greater awareness that microtask crowdsourcing has various methodological limitations, including a range of sociotechnical biases, which are only slowly starting to be explored by the scientific community [1]. At the same time, close to discarding it all together as a means to collect labelled data for AI, there are no real alternatives at the same scale, and when the resources available to invest in data work are limited.
Some crowdsourced datasets have been studied extensively to uncover hidden biases and fix them. However, for a majority of datasets and tasks used in AI systems benchmarks, in the scientific community and beyond, extensive studies or workflows and tools to detect and mitigate for such biases are missing. In addition, crowdsourcing as a method to undertake data work is hardly reproducible or replicable, with very few studies attempting to re-create existing datasets and crowdsourcing pipelines for image labelling or named entity recognition. While some progress has been made on topics such as fair payments, empowering crowd workers, and reproducing results, there is much less work on replicability, regarding methodologies, creating an evidence base of replicated studies, and platform portability.
This project will aim to systematically change this. First, the project will identify a set of datasets and tasks representative for the work currently outsourced to online crowdsourcing platforms and for the state of the art in AI research (e.g. looking at machine learning challenges on Kaggle or co-located with scientific conferences). Then, it will run experiments on multiple platforms with a diverse range of functionalities (including Mechanical Turk, Prolific, SageMaker GroundTruth, Zooniverse) on a core set of datasets and tasks, replicating and reproducing them. This will result in a set of workflows for different platforms, insights into biases in the datasets, best practices for AI researchers and practitioners, and recommendations for platform extensions and data documentation guidance.
The studentship is primarily targeted at UK students who qualify for home fees. The ideal candidate must have a strong computer science background, excellent programming skills, and some knowledge of machine learning techniques. Excellent communication skills, both verbally and in writing are essential. Ideally, the candidate would have practical experience in working with online crowdsourcing platforms.
How to apply
Candidates must apply via King’s Apply online application system. Details are available at How to apply - King's College London (kcl.ac.uk).
Please indicate Professor Elena Simperl as the supervisor and and quote the project title “Responsible Crowdsourced Data Labelling for AI” within your application and in all correspondence.
The selection process will involve a pre-selection on documents and, if selected, will be followed by an invitation to an interview. If successful at the interview, an offer will be provided in due time.
Queries
Please direct all queries regarding this project to Prof. Elena Simperl, elena.simperl@kcl.ac.uk.
(Again for applications - please read the 'How to Apply' and submit via King's Apply)
Funding Notes
References
Email Now
Why not add a message here
The information you submit to King’s College London will only be used by them or their data partners to deal with your enquiry, according to their privacy notice. For more information on how we use and store your data, please read our privacy statement.

Search suggestions
Based on your current searches we recommend the following search filters.
Check out our other PhDs in London, United Kingdom
Check out our other PhDs in United Kingdom
Start a New search with our database of over 4,000 PhDs

PhD suggestions
Based on your current search criteria we thought you might be interested in these.
Data mining techniques for road safety policy making
University of Birmingham
Adaptive numerical algorithms for PDE problems with random input data
University of Birmingham
Multi-omics data fusion for better understanding of host-microbe interactions in health and disease
University of Reading