Don't miss our weekly PhD newsletter | Sign up now Don't miss our weekly PhD newsletter | Sign up now

  Towards GC-MS: Adapting SIRIUS and CSI:FingerID for Electron Ionization fragmentation


   International Max Planck Research School

This project is no longer listed on FindAPhD.com and may not be available.

Click here to search FindAPhD.com for PhD studentship opportunities
  Prof S Böcker, Prof Georg Pohnert  No more applications being accepted  Competition Funded PhD Project (Students Worldwide)

About the Project

Background: Mass spectrometry (MS) is the analytical platform of choice for high-throughput screening of small molecules. MS is typically used in combination with a chromatographic separation technology, namely liquid chromatography (LC-MS) or gas chromatography (GC-MS). Of these, the latter is more prevalent and arguably still the best separation tool in common use for compounds amenable to the technique.

The most common ionization technique in GC-MS is 70 eV Electron (impact) Ionization (EI) which simultaneously ionizes and fragments the molecule. Resulting spectra are fragment-rich but often also show a low-intensity or missing molecular ion peak; to this end, the mass of the compound is often unknown. Another problem particular to GC-MS is that compounds not separated by the GC column are fragmented simultaneously, so that spectra may contain signals that actually belong to different compounds. Lately, technically mature GC-MS instruments with high mass accuracy are available, making de novo interpretation of EI fragmentation data possible.

The Böcker group develops the computational tools SIRIUS and CSI:FingerID, which are in high use for the identification and structural elucidation of small molecules using LC-MS and tandem MS data; in the last 10 months, data from more than 500,000 compounds have been uploaded to our web service for identification. SIRIUS allows annotation of fragmentation spectra using fragmentation trees, such that all fragments are annotated by the molecular formula of the corresponding fragment, and does not require any databases for doing so.

CSI:FingerID allows to search the unknown compound in a structure database such as HMDB or PubChem, and had by far the most correct identifications in CASMI 2016 challenge (Schymanski et al., J Cheminf 2017).

Project Description: With the advent of GC-MS instrumentation with high mass accuracy, it becomes possible to adapt our computational tools SIRIUS and CSI:FingerID for GC-MS data. GC-MS and EI fragmentation is different in many details from LC-MS and tandem MS; so, many subproblems have to be addressed:

1. The Maximum a posteriori estimator from Böcker and Dührkop (J Cheminf, 2016) has to be retrained on and adapted for EI fragmentation data.
2. EI mass spectra are often missing the molecular ion peak, and the mass and/or molecular formula of the compound has to be reconstructed from the fragments (Hufsky et al., Anal Chim Acta 2012).
3. EI mass spectra contain isotope patterns, which can be used to improve fragmentation tree quality. Unfortunately, EI fragmentation rather often results in radical losses H and H3, which can interfere with the interpretation of the isotope pattern.
4. EI mass spectra contain significantly more peaks than tandem mass spectra, so we have to speed up computations for the underlying problem.
5. Available reference data for high mass accuracy GC-MS is insufficient to train Machine Learning methods such as CSI:FingerID. To bypass this, we want to develop methods to compute fragmentation trees from mass spectra of reference compounds with lower mass accuracy (“lifting”).
6. On the other side, significant expert knowledge is available for EI fragmentation, much more than for tandem MS fragmentation; we will try to integrate this expert knowledge into our computational methods.
7. We will modify CSI:FingerID so that we can use EI fragmentation spectra for searching in a molecular structure database.
8. Finally, we will apply the developed methods to biological data; this will not be carried out at the end of the project, but rather interwoven with methods development.

The proposed working points are beyond the reach of a single PhD student; we plan to apply for an additional PhD student from an alternative funding source (DFG), which will also allow us to measure additional reference data for estimating SIRIUS parameters and training CSI:FingerID.

The project will be conducted in close collaboration with experimentalist groups, in particular those of Georg Pohnert and Ales Svatos; we also want to start a collaboration with Joachim Kopka (Max Planck Institute of Molecular Plant Physiology, Potsdam) and his group on the subject, who have shown much interest in doing so.

Candidate profile:
• M.Sc. in bioinformatics, cheminformatics, computer science, mathematics
• Expertise and interest in algorithmics and bioinformatics methods development
• Experience in biochemistry is highly desirable
• Expertise in Machine Learning is desirable
• Experience in software development (Git, artifactory)
• Java, Python
• Ability to interact with scientists in the group
• Ability to interact with collaboration partners and software users
• Interest in learning novel methods and developing new skills, both on the computational and the applied side
• Scientific thinking and attitude

 About the Project