Power Monitoring

Learn how to measure, monitor, and optimize energy consumption in High-Performance Computing systems. This module covers the fundamental concepts of power monitoring, explores the technical implementations of modern monitoring systems (RAPL, NVML, HDEEM, and more), and provides practical techniques for power-aware resource management.

Prerequisites

  • Basic understanding of HPC architecture and supercomputing concepts

  • Familiarity with Linux command-line tools and system administration

  • Knowledge of basic Python for data analysis

  • Understanding of CPU and GPU fundamentals

Description

This module provides a comprehensive introduction to power monitoring in HPC systems. It is organized into two main episodes:

Episode 0: Power Monitoring Introduction covers the foundational concepts necessary for understanding power measurement in HPC. Topics include the mathematical relationship between energy and power, the hierarchical structure of power monitoring infrastructure (component-level, node-level, facility-level), a survey of different monitoring system types, and the critical trade-offs between in-band (real-time but partial) and out-of-band (complete but latent) monitoring approaches. The episode concludes with techniques for establishing accurate power baselines—essential for energy attribution across unmeasured hardware components.

Episode 1: Power Monitoring Systems provides technical deep-dives into specific monitoring implementations deployed on modern HPC systems. It covers eight distinct monitoring technologies: HDEEM (out-of-band), Intel RAPL and AMD RAPL (in-band x86), Fujitsu A64FX (custom ARM processor), NVIDIA GPU NVML, AMD GPU SMI, NVIDIA GRACE CPU (sysfs), and the complex heterogeneous NVIDIA GRACE HOPPER architecture. The episode includes real-world case studies from operational supercomputers (LUMI, Karolina) with practical power baseline models and energy accounting techniques.

Prerequisites

  • Linux command-line proficiency

  • Basic Python for data analysis (NumPy, SciPy, Matplotlib optional)

  • Access to HPC system or local machine with power monitoring capabilities (or use pre-recorded data)

Course Topics

  • Energy and Power fundamentals: $$E = P \times \Delta t$$, Joules, Wh, and practical calculations

  • Power monitoring hierarchy: components, nodes, racks, facility, and building-level

  • In-band vs Out-of-band monitoring trade-offs: accuracy, latency, overhead, accessibility

  • Intel RAPL: MSR registers, domains, wraparound handling, accuracy characteristics

  • AMD RAPL: compatibility, read-only scope, per-core granularity

  • HDEEM and out-of-band systems: high-resolution measurement, deployment challenges

  • GPU power monitoring: NVIDIA NVML API, AMD SMI, domain definitions

  • Heterogeneous systems: NVIDIA GRACE Hopper CPU+GPU power attribution

  • Power baselines: calibration methods, system-specific factors, unmeasured component estimation

  • Real-world case studies: LUMI and Karolina supercomputers with baseline models

  • Energy accounting and workload characterization for HPC jobs

Target Audience

Level: Intermediate to Advanced

Prerequisites: Comfortable with Linux systems, Python basics, and HPC fundamentals. Experience with HPC job submission and system administration is beneficial but not required.

Language: English

Technical Requirements

  • Python and its dependencies

Instructors

Ondrej Vysocky is a senior researcher at the Infrastructure Research Laboratory within the IT4Innovations National Supercomputing Center. His work primarily focuses on the reduction of the energy consumption of supercomputers to lower operating costs, achieving annual savings in the millions of crowns and a significant reduction in the carbon footprint of computations.

Learning outcomes

This module prepares HPC practitioners, system administrators, researchers, and performance engineers to measure and optimize power consumption in supercomputing systems.

By the end of this module, learners should be able to:

  • Understand power/energy fundamentals: Distinguish power (W) from energy (J), perform accurate energy calculations (E = P × Δt), and understand why power budgets matter in HPC

  • Navigate the monitoring hierarchy: Explain component-level, node-level, rack-level, and facility-level power accounting, and describe appropriate use cases for each level

  • Compare monitoring approaches: Evaluate trade-offs between in-band (RAPL, NVML) and out-of-band (HDEEM, PDU) monitoring in terms of accuracy, latency, overhead, and deployment complexity

  • Interpret RAPL measurements: Read Intel/AMD RAPL data via sysfs/MSR, understand domain definitions, handle 32-bit counter wraparound, and account for measurement accuracy (±5-10%)

  • Access GPU power data: Use NVIDIA NVML and AMD SMI APIs to measure GPU power consumption and understand per-domain power distribution

  • Attribute power on heterogeneous systems: Calculate CPU, GPU, and interconnect power on systems like NVIDIA GRACE Hopper using dual-interface approaches (HWMON + NVML)

  • Establish power baselines: Calibrate system-specific power models to quantify unmeasured hardware components and enable accurate energy attribution

  • Apply knowledge in practice: Deploy power monitoring in real HPC environments, validate against facility measurements, and use power data for workload characterization and optimization decisions

See also

Credit

Don’t forget to check out additional course materials from IT4Innovations National Supercomputing Center. Please contact us if you want to reuse these course materials in your teaching. You can also join the LinkedIn to share your experience and get more help from the community.

License

Note

To module authors: For code you may use any OSI-approved license as mentioned in https://spdx.org/licenses/, such as Apache License 2.0, GNU GPLv3, MIT. Please make sure to update the deed above and LICENSE.code file accordingly.