Machine learning and Natural Language Processing for authorship identification

   School of Arts, Languages and Cultures

  Dr Andrea Nini  Friday, June 30, 2023  Self-Funded PhD Students Only

About the Project

Authorship identification, sometimes also called authorship attribution or authorship analysis, is the task of determining who wrote a document based on the author’s use of language. This problem is often encountered in situations in which a document is evidence in a forensic case, for example when the authorship of a text like a threatening letter is crucial for investigations. More specifically, this is a task that is often needed by practitioners of forensic linguistics, a relatively new field that concerns the application of linguistics to forensic problems (Coulthard et al. 2017).

Authorship identification has been extensively studied from a computational perspective (e.g. Juola 2008, Stamatatos 2009). However, the standard methods used in computer science can be further improved with insight from linguistics.

This project consists in the experimentation of more linguistically plausible applications of techniques borrowed from machine learning and natural language processing to the problem of authorship identification. The applicant is expected to have a degree in Computer Science, preferably with experience of working with large data sets.

Coulthard, M., Johnson, A. and Wright, D. (2017) An Introduction to Forensic Linguistics, London, Routledge.

Juola, P. (2008) Authorship Attribution, Foundations and Trends in Information Retrieval1(3), pp. 233–334.

Stamatatos, E. (2009) A survey of modern authorship attribution methods, Journal of the American Society for Information Science and Technology60(3), pp. 538–556.”

