Power Monitoring¶

Learn how to measure, monitor, and optimize energy consumption in High-Performance Computing systems. This module covers the fundamental concepts of power monitoring, explores the technical implementations of modern monitoring systems (RAPL, NVML, HDEEM, and more), and provides practical techniques for power-aware resource management.

Prerequisites

Basic understanding of HPC architecture and supercomputing concepts
Familiarity with Linux command-line tools and system administration
Knowledge of basic Python for data analysis
Understanding of CPU and GPU fundamentals

Episodes

Assessment

Quiz: HPC Power Monitoring (Episodes 0 & 1)

Reference

Description¶

This module provides a comprehensive introduction to power monitoring in HPC systems. It is organized into two main episodes:

Episode 0: Power Monitoring Introduction covers the foundational concepts necessary for understanding power measurement in HPC. Topics include the mathematical relationship between energy and power, the hierarchical structure of power monitoring infrastructure (component-level, node-level, facility-level), a survey of different monitoring system types, and the critical trade-offs between in-band (real-time but partial) and out-of-band (complete but latent) monitoring approaches. The episode concludes with techniques for establishing accurate power baselines—essential for energy attribution across unmeasured hardware components.

Episode 1: Power Monitoring Systems provides technical deep-dives into specific monitoring implementations deployed on modern HPC systems. It covers eight distinct monitoring technologies: HDEEM (out-of-band), Intel RAPL and AMD RAPL (in-band x86), Fujitsu A64FX (custom ARM processor), NVIDIA GPU NVML, AMD GPU SMI, NVIDIA GRACE CPU (sysfs), and the complex heterogeneous NVIDIA GRACE HOPPER architecture. The episode includes real-world case studies from operational supercomputers (LUMI, Karolina) with practical power baseline models and energy accounting techniques.

Prerequisites

Linux command-line proficiency
Basic Python for data analysis (NumPy, SciPy, Matplotlib optional)
Access to HPC system or local machine with power monitoring capabilities (or use pre-recorded data)

Course Topics¶

Energy and Power fundamentals: $$E = P \times \Delta t$$, Joules, Wh, and practical calculations
Power monitoring hierarchy: components, nodes, racks, facility, and building-level
In-band vs Out-of-band monitoring trade-offs: accuracy, latency, overhead, accessibility
Intel RAPL: MSR registers, domains, wraparound handling, accuracy characteristics
AMD RAPL: compatibility, read-only scope, per-core granularity
HDEEM and out-of-band systems: high-resolution measurement, deployment challenges
GPU power monitoring: NVIDIA NVML API, AMD SMI, domain definitions
Heterogeneous systems: NVIDIA GRACE Hopper CPU+GPU power attribution
Power baselines: calibration methods, system-specific factors, unmeasured component estimation
Real-world case studies: LUMI and Karolina supercomputers with baseline models
Energy accounting and workload characterization for HPC jobs

Target Audience¶

Level: Intermediate to Advanced

Prerequisites: Comfortable with Linux systems, Python basics, and HPC fundamentals. Experience with HPC job submission and system administration is beneficial but not required.

Language: English

Technical Requirements¶

Python and its dependencies

Instructors¶

Ondrej Vysocky is a senior researcher at the Infrastructure Research Laboratory within the IT4Innovations National Supercomputing Center. His work primarily focuses on the reduction of the energy consumption of supercomputers to lower operating costs, achieving annual savings in the millions of crowns and a significant reduction in the carbon footprint of computations.

Learning outcomes¶

This module prepares HPC practitioners, system administrators, researchers, and performance engineers to measure and optimize power consumption in supercomputing systems.

By the end of this module, learners should be able to:

Understand power/energy fundamentals: Distinguish power (W) from energy (J), perform accurate energy calculations (E = P × Δt), and understand why power budgets matter in HPC
Navigate the monitoring hierarchy: Explain component-level, node-level, rack-level, and facility-level power accounting, and describe appropriate use cases for each level
Compare monitoring approaches: Evaluate trade-offs between in-band (RAPL, NVML) and out-of-band (HDEEM, PDU) monitoring in terms of accuracy, latency, overhead, and deployment complexity
Interpret RAPL measurements: Read Intel/AMD RAPL data via sysfs/MSR, understand domain definitions, handle 32-bit counter wraparound, and account for measurement accuracy (±5-10%)
Access GPU power data: Use NVIDIA NVML and AMD SMI APIs to measure GPU power consumption and understand per-domain power distribution
Attribute power on heterogeneous systems: Calculate CPU, GPU, and interconnect power on systems like NVIDIA GRACE Hopper using dual-interface approaches (HWMON + NVML)
Establish power baselines: Calibrate system-specific power models to quantify unmeasured hardware components and enable accurate energy attribution
Apply knowledge in practice: Deploy power monitoring in real HPC environments, validate against facility measurements, and use power data for workload characterization and optimization decisions

Power Monitoring¶

Description¶

Course Topics¶

Target Audience¶

Technical Requirements¶

Instructors¶

Learning outcomes¶

See also¶