Don't miss our weekly PhD newsletter | Sign up now Don't miss our weekly PhD newsletter | Sign up now

  The analysis of large transcriptomic data sets using distributed computing


   Dept of Computer Science

This project is no longer listed on FindAPhD.com and may not be available.

Click here to search FindAPhD.com for PhD studentship opportunities
Dr H Shanahan  Applications accepted all year round  Competition Funded PhD Project (Students Worldwide)

About the Project

While genomic data give us important insights into the Molecular Biology of a cell, transcriptomic data (which tells us what section of the genome is being actively transcribed) gives us a first dynamic picture of the cell. It has been hoped that such data sets will provide large-scale gene networks and hence provide the framework for many of the goals of Systems Biology.

There exist a large publicly-available data set of transcriptomic data using micro-array technology and as the cost of sequencing has collapsed (see figure 1) RNA-seq data which can provide potentially much more sensitive results are likely to become common-place and applied in a variety of areas such as biomedical research, clinical applications and agrotechnology.

There exist two significant challenges to the successful application of this data. In the first instance, present transcrptomic datasets have not been as successful as hoped in inferring gene interactions. One cause for this are biases that exist in the data due to anomalous hybridisations (shown in figure 2). It is likely that similar biases will exist for RNA-seq data. Removing such biased data, and understanding other such biases could substantially improve the quality of predications. A second challenge is that RNA-seq datasets are inherently much larger than micro-array data sets and global studies based over many different experiments will require 100's of Tbytes (if not Pbytes) of storage. Such data sets cannot be easily moved via the Internet. It is likely then that such data will be stored in data-centres that are co-located with where they are generated. An exemplar of this is the BGI, a private company based in Shenzhen, China who are the major providers on next generation sequencing facilities in the world who are also offering a cloud-computing service to analyse their data.

The project is composed of three parts. In the first instance, the student will examine a variety of different cloud computing and distributed computing platforms ranging from purely commercial solutions (EC2, Azure, Google cloud) to open source solutions (OpenNebula) and different paradigms (PaaS to SaaS) to determine the optimal configuration for the analysis of such data.

In the second instance, the student will scale up a pilot analysis carried out on one type micro-array to a wide variety of micro-arrays (GeneChips, SNPChips and tiling arrays) using the cloud platform determined previously to be optimal for these types of problems. The emphasis will be on providing summary measures that can be used reliably for quality control purposes.

Finally, the student will then extend the analysis to the analysis of RNA-seq data to see if this data is also susceptible to the kind of sequence biases that occur in micro-arrays.

Where will I study?

 About the Project