FindAPhD Weekly PhD Newsletter | JOIN NOW FindAPhD Weekly PhD Newsletter | JOIN NOW

HPC workload generation and benchmarking


   School of Informatics

This project is no longer listed on FindAPhD.com and may not be available.

Click here to search FindAPhD.com for PhD studentship opportunities
  Dr M Weiland, Dr Andy Turner  No more applications being accepted  Funded PhD Project (European/UK Students Only)

About the Project

Artificial intelligence and data analytics applications have driven rapid changes in the world of computing in recent years, and these changes are now showing an impact on distributed and high-performance computing (HPC). Workloads on HPC systems have begun to move away from being exclusively used for (often monolithic) scientific codes towards a broader range of applications and application workflows - a phenomenon that has been observed in particular on smaller systems (i.e. regional Tier-2 systems and lower). Machine learning and data analytics techniques are also increasingly being incorporated into "traditional" numerical applications, e.g. to find a better initial guess for the start of a computation. The impact of these changing workloads (which now often include machine learning codes, containerised applications, or codes written using interpreted/JIT languages) on overall system performance and throughput is unclear. In fact, it is also unclear to what degree the workloads are actually changing, and how this change manifests itself (if at all) across the different tiers of HPC.

This project will investigate the historical workload of the UK National HPC service ARCHER, which was operational for over 7 years, and the current workload its successor ARCHER2, as well as on the Tier-2 system Cirrus. The historical workload data for ARCHER (which has now been decommissioned) is particularly valuable, as it presents a complete view of how the workload on such a large-scale HPC system has changed over the liftetime of this system; this workload data is now available for analysis and it is unique in its breadth of scientific applications, level of detail and time span. For the current systems ARCHER2 and Cirrus, data can be extracted and linked to current system states. The project will assess how (and if) the application mix that is present on the systems has changed (and is continuing to change) over time and analyse the impact the workload has on the systems overall. This will be achieved by linking the application mix to system monitoring data in order to understand the impact of the workload on shared resources such as the network and I/O subsystem in particular. This study has the opportunity to deliver a number of insights - examples are:

  • What application mix has a negative impact of system throughput?
  • Is the scheduler configured efficiently?
  • Are there particular applications that are "bad citizens" by negatively impacting other applications on the system?
  • Is the I/O subsystem provisioning configured optimally?

The project will also develop a workload generator that can, using job and monitoring data from the real workloads and systems, create a range of artificial workloads that are representative of real-life system load. The aim of the workload generator, and the resulting benchmarks, is for it to be a tool that can be used to test the overall impact of system changes (such as file system or job scheduler changes) on a test bed, ahead of making those changes live on a production system. The workload generator can also be used as a tool for procurement, to assess whether a certain system design is likely to deliver the desired performance given the expected workload. Finally, the workload generator can be used to verify and validate any assumptions derived from the data analysis of the workloads (e.g. "applications X and Y cause network contention if run at the same time").

The student will have access to a wide range of systems at EPCC to support the data processing and analysis, and software development and benchmarking efforts in this project: from large shared-memory to machine learning specific systems that are part of the Edinburgh International Data Facility to a variety of HPC systems, from small clusters to national facilities.

Overview of research area:

This project spans a number of research areas:

  • HPC system benchmarking: understanding the impact of a collection of applications on the HPC system performance, in particular on shared resources such as the network and the parallel file system;
  • Data acquisition and data analysis: gathering, linking and analysing workload data from multiple and diverse sources (e.g. resource utilisation and system monitoring tools), to derive insight into how workload and system performance/throughput relate;
  • HPC workload generation: creating a benchmark generator that can produce workloads representative of a diverse set of usage scenarios and systems.

Potential research question(s):

  • Are the workloads on HPC system changing? If so, what is the impact on performance and throughput?
  • Are workloads changing in the same way on different tiers of HPC systems, or is the change more pronounced on the lower tiers?
  • Assuming the workload is changing:
  • How is this change manifesting itself?
  • Do system designs/setups have to adapt to deal with the change in workload?
  • How can we benchmark system performance for a workload, as opposed to a single application?
  • How can we derive insight from system monitoring data that comes from a wide range of sources (e.g. resource manager, profiling tools, system health check tools, etc)?

Student Requirements:

Note that these are the minimum requirements to be considered for admission.

A UK 2:1 honours degree, or its international equivalent, in a relevant subject such as computer science and informatics, physics, mathematics, engineering, biology, chemistry and geosciences.

You must be a competent programmer in at least one of C, C++, Python, Fortran, or Java and should be familiar with mathematical concepts such as algebra, linear algebra and probability and statistics.

English Language requirements as set by University of Edinburgh.

Student Recommended/Desirable Skills and Experience

  • Experience of HPC system benchmarking and/or monitoring
  • Understanding of HPC system architectures

Funding Notes

EPCC holds the following funding opportunities across its PhD opportunities at present for which this project is one of many eligible (i.e. competitive funding):

For entry during academic year 2022-23:
1 EPSRC studentships with standard EPSRC eligibility: https://www.findaphd.com/funding/guides/epsrc-funding.aspx
We also welcome applications for these projects from students with their own source(s) of funding.

References

[1] Turner, Andrew, Dominic Sloan-Murphy, Karthee Sivalingam, Harvey Richardson and Julian M. Kunkel. “Analysis of parallel I/O use on the UK national supercomputing service, ARCHER using Cray LASSi and EPCC SAFE.” ArXiv arXiv:1906.03891 (2019)
[2] Shane Snyder, Philip Carns, Robert Latham, Misbah Mubarak, Robert Ross, Christopher Carothers, Babak Behzad, Huong Vu Thanh Luu, Surendra Byna, and Prabhat. 2015. "Techniques for modeling large-scale HPC I/O workloads". In Proceedings of the 6th International Workshop on Performance Modeling, Benchmarking, and Simulation of High Performance Computing Systems (PMBS '15). Association for Computing Machinery, New York, NY, USA, Article 5, 1–11. https://doi.org/10.1145/2832087.2832091 (link is external)
[3] McLay, Robert; Fahey, Mark R, "Understanding the Software Needs of High End Computer Users with XALT", Texas Advanced Computing Center, 2015. http://dx.doi.org/10.15781/T2PP4P (link is external)
[4] ECMWF Kronos HPC workload simulator, https://github.com/ecmwf/kronos (link is external)
[5] V. Ahlgren et al., "Large-Scale System Monitoring Experiences and Recommendations," 2018 IEEE International Conference on Cluster Computing (CLUSTER), 2018, pp. 532-542, doi: 10.1109/CLUSTER.2018.00069.
[6] A. Netti, D. Tafani, M. Ott and M. Schulz, "Correlation-wise Smoothing: Lightweight Knowledge Extraction for HPC Monitoring Data," 2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS), 2021, pp. 2-12, doi: 10.1109/IPDPS49936.2021.00010.
[7] Huerta, E.A., Khan, A., Davis, E. et al. Convergence of artificial intelligence and high performance computing on NSF-supported cyberinfrastructure. J Big Data 7, 88 (2020). https://doi.org/10.1186/s40537-020-00361-2 (link is external)

How good is research at University of Edinburgh in Computer Science and Informatics?


Research output data provided by the Research Excellence Framework (REF)

Click here to see the results for all UK universities
Search Suggestions
Search suggestions

Based on your current searches we recommend the following search filters.

PhD saved successfully
View saved PhDs