According to various studies [1,2,3], upper gastrointestinal (UGI) cancers are frequently missed at endoscopy. Also the percentage of early diagnosis of such cancers is relatively low around 12% and has remained unchanged for an number of years. The study hypothesis is that demographic and clinical variables extracted from routinely collected data in written endoscopy and histology reports, at the time of a "cancer-negative" endoscopy, serve as risk factors for interval cancers of the UGI tract.
While the combination of state-of-the art machine learning approaches and routinely collected health data for the development and validation of risk prediction models are established in a number of diseases including colorectal cancer , evidence of their application for the prediction of interval UGI cancers from such sources are lacking. This study aims to establish the feasibility of applying advanced text mining to routinely collected endoscopy reports to predict missed UGI tract cancers.
Text mining methods have advanced rapidly in recent years and deep learning in particular offer much promise . Deep learning algorithms such as Recurrent Neural Networks or Convolutional Neural Networks may discover new features associated with missed cancers in medical reports. We will compare different feature extraction methods (e.g. n-grams, word embeddings, etc.) and also different algorithms (classical text mining algorithms, deep learning, etc.) in order to find a most efficient approach. This project will be in collaboration with Norwich Medical School and the Department of Gastroenterology, Norfolk and Norwich University Hospital, so it will be interdisciplinary. They will provide us with access to the necessary data for the study. We will offer the student training in machine learning, and algorithm development as well as in aspects of the medical domain necessary to analyse the data successfully.