Machbarkeitsstudie für die Erkennung von verschiedenen Job Abschnitten mittels eines System Monitoring Daemons

  • Feasibility study for detecting different job stages using a System Monitoring Daemon

Roigk, Julia; Müller, Matthias S. (Thesis advisor); Schulz, Martin (Thesis advisor); Schürhoff, Daniel (Consultant)

Aachen : RWTH Aachen University (2022)
Master Thesis

Masterarbeit, RWTH Aachen University, 2022


High Performance Computing clusters are powerful but expensive, which makes ensuring efficient resource usage vital. On the CLAIX systems at the RWTH Aachen, this is done through non-invasive background monitoring using a system monitoring daemon. While this data is currently being analyzed with statistical means, two aspects of the hardware performance monitoring have not yet been explored: the information inherent in the raw time series data, and the impact of the monitoring resolution. Vital information on the make-up and resource usage of jobs on the cluster may thus be left out. The extent to which the raw time series data can be used to supplement the existing analysis process is unclear, and equally unclear is if the monitoring resolution is sufficient to detect interesting patterns. Thus, this thesis investigates the feasibility of utilizing this data on the basis of time intervals exhibiting distinct usage profiles, called "job stages". We start by examining the cost of different monitoring frequencies. For this, we measure the run time and evaluate performance reports for a set of established parallel benchmarks. In a second step, we explore the expression of job stages in the performance monitoring data when using different monitoring frequencies. We implement an application exhibiting known and distinct resource usage patterns and compare the output of the hardware performance monitoring between the different monitoring frequencies. This knowledge is then combined into a rule set to detects stages and periodically repeating patterns in the monitoring data. We implement a filter based on this rule set and filter the performance monitoring data collected between January and October of 2021 for jobs exhibiting stages. Additionally, an adaptive approach regarding the monitoring resolution is described. This could enable targeted closer monitoring without incurring significant slowdowns by implementing more frequent monitoring across the entire cluster.