Data is just as important as algorithms to use AI effectively and responsibly. Many AI and other data-intensive technologies use microtask crowdsourcing extensively to collect, curate, and enrich the data they need. With greater use, there is also greater awareness that microtask crowdsourcing has various methodological limitations, including a range of sociotechnical biases, which are only slowly starting to be explored by the scientific community . At the same time, close to discarding it all together as a means to collect labelled data for AI, there are no real alternatives at the same scale, and when the resources available to invest in data work are limited.
Some crowdsourced datasets have been studied extensively to uncover hidden biases and fix them. However, for a majority of datasets and tasks used in AI systems benchmarks, in the scientific community and beyond, extensive studies or workflows and tools to detect and mitigate for such biases are missing. In addition, crowdsourcing as a method to undertake data work is hardly reproducible or replicable, with very few studies attempting to re-create existing datasets and crowdsourcing pipelines for image labelling or named entity recognition. While some progress has been made on topics such as fair payments, empowering crowd workers, and reproducing results, there is much less work on replicability, regarding methodologies, creating an evidence base of replicated studies, and platform portability.
This project will aim to systematically change this. First, the project will identify a set of datasets and tasks representative for the work currently outsourced to online crowdsourcing platforms and for the state of the art in AI research (e.g. looking at machine learning challenges on Kaggle or co-located with scientific conferences). Then, it will run experiments on multiple platforms with a diverse range of functionalities (including Mechanical Turk, Prolific, SageMaker GroundTruth, Zooniverse) on a core set of datasets and tasks, replicating and reproducing them. This will result in a set of workflows for different platforms, insights into biases in the datasets, best practices for AI researchers and practitioners, and recommendations for platform extensions and data documentation guidance.
The studentship is primarily targeted at UK students who qualify for home fees. The ideal candidate must have a strong computer science background, excellent programming skills, and some knowledge of machine learning techniques. Excellent communication skills, both verbally and in writing are essential. Ideally, the candidate would have practical experience in working with online crowdsourcing platforms.
How to apply
Candidates must apply via King’s Apply online application system. Details are available at How to apply - King's College London (kcl.ac.uk).
Please indicate Professor Elena Simperl as the supervisor and and quote the project title “Responsible Crowdsourced Data Labelling for AI” within your application and in all correspondence.
The selection process will involve a pre-selection on documents and, if selected, will be followed by an invitation to an interview. If successful at the interview, an offer will be provided in due time.
Please direct all queries regarding this project to Prof. Elena Simperl, email@example.com.
(Again for applications - please read the 'How to Apply' and submit via King's Apply)