Power Monitoring¶
Learn how to measure, monitor, and optimize energy consumption in High-Performance Computing systems. This module covers the fundamental concepts of power monitoring, explores the technical implementations of modern monitoring systems (RAPL, NVML, HDEEM, and more), and provides practical techniques for power-aware resource management.
Prerequisites
Basic understanding of HPC architecture and supercomputing concepts
Familiarity with Linux command-line tools and system administration
Knowledge of basic Python for data analysis
Understanding of CPU and GPU fundamentals
Assessment
Reference
Description¶
This module provides a comprehensive introduction to power monitoring in HPC systems. It is organized into two main episodes:
Episode 0: Power Monitoring Introduction covers the foundational concepts necessary for understanding power measurement in HPC. Topics include the mathematical relationship between energy and power, the hierarchical structure of power monitoring infrastructure (component-level, node-level, facility-level), a survey of different monitoring system types, and the critical trade-offs between in-band (real-time but partial) and out-of-band (complete but latent) monitoring approaches. The episode concludes with techniques for establishing accurate power baselines—essential for energy attribution across unmeasured hardware components.
Episode 1: Power Monitoring Systems provides technical deep-dives into specific monitoring implementations deployed on modern HPC systems. It covers eight distinct monitoring technologies: HDEEM (out-of-band), Intel RAPL and AMD RAPL (in-band x86), Fujitsu A64FX (custom ARM processor), NVIDIA GPU NVML, AMD GPU SMI, NVIDIA GRACE CPU (sysfs), and the complex heterogeneous NVIDIA GRACE HOPPER architecture. The episode includes real-world case studies from operational supercomputers (LUMI, Karolina) with practical power baseline models and energy accounting techniques.
Prerequisites
Linux command-line proficiency
Basic Python for data analysis (NumPy, SciPy, Matplotlib optional)
Access to HPC system or local machine with power monitoring capabilities (or use pre-recorded data)
Course Topics¶
Energy and Power fundamentals: $$E = P \times \Delta t$$, Joules, Wh, and practical calculations
Power monitoring hierarchy: components, nodes, racks, facility, and building-level
In-band vs Out-of-band monitoring trade-offs: accuracy, latency, overhead, accessibility
Intel RAPL: MSR registers, domains, wraparound handling, accuracy characteristics
AMD RAPL: compatibility, read-only scope, per-core granularity
HDEEM and out-of-band systems: high-resolution measurement, deployment challenges
GPU power monitoring: NVIDIA NVML API, AMD SMI, domain definitions
Heterogeneous systems: NVIDIA GRACE Hopper CPU+GPU power attribution
Power baselines: calibration methods, system-specific factors, unmeasured component estimation
Real-world case studies: LUMI and Karolina supercomputers with baseline models
Energy accounting and workload characterization for HPC jobs
Target Audience¶
Level: Intermediate to Advanced
Prerequisites: Comfortable with Linux systems, Python basics, and HPC fundamentals. Experience with HPC job submission and system administration is beneficial but not required.
Language: English
Technical Requirements¶
Python and its dependencies
Instructors¶
Ondrej Vysocky is a senior researcher at the Infrastructure Research Laboratory within the IT4Innovations National Supercomputing Center. His work primarily focuses on the reduction of the energy consumption of supercomputers to lower operating costs, achieving annual savings in the millions of crowns and a significant reduction in the carbon footprint of computations.
Learning outcomes¶
This module prepares HPC practitioners, system administrators, researchers, and performance engineers to measure and optimize power consumption in supercomputing systems.
By the end of this module, learners should be able to:
Understand power/energy fundamentals: Distinguish power (W) from energy (J), perform accurate energy calculations (E = P × Δt), and understand why power budgets matter in HPC
Navigate the monitoring hierarchy: Explain component-level, node-level, rack-level, and facility-level power accounting, and describe appropriate use cases for each level
Compare monitoring approaches: Evaluate trade-offs between in-band (RAPL, NVML) and out-of-band (HDEEM, PDU) monitoring in terms of accuracy, latency, overhead, and deployment complexity
Interpret RAPL measurements: Read Intel/AMD RAPL data via sysfs/MSR, understand domain definitions, handle 32-bit counter wraparound, and account for measurement accuracy (±5-10%)
Access GPU power data: Use NVIDIA NVML and AMD SMI APIs to measure GPU power consumption and understand per-domain power distribution
Attribute power on heterogeneous systems: Calculate CPU, GPU, and interconnect power on systems like NVIDIA GRACE Hopper using dual-interface approaches (HWMON + NVML)
Establish power baselines: Calibrate system-specific power models to quantify unmeasured hardware components and enable accurate energy attribution
Apply knowledge in practice: Deploy power monitoring in real HPC environments, validate against facility measurements, and use power data for workload characterization and optimization decisions
See also¶
Credit
Don’t forget to check out additional course materials from IT4Innovations National Supercomputing Center. Please contact us if you want to reuse these course materials in your teaching. You can also join the LinkedIn to share your experience and get more help from the community.
License
CC BY-SA for media and pedagogical material
Copyright © 2026 IT4Innovations National Supercomputing Center. This material is released by IT4Innovations National Supercomputing Center under the Creative Commons Attribution-ShareAlike 4.0 International (CC BY-SA 4.0).
Canonical URL: https://creativecommons.org/licenses/by-sa/4.0/
You are free to
Share — copy and redistribute the material in any medium or format for any purpose, even commercially.
Adapt — remix, transform, and build upon the material for any purpose, even commercially.
The licensor cannot revoke these freedoms as long as you follow the license terms.
Under the following terms
Attribution — You must give appropriate credit , provide a link to the license, and indicate if changes were made . You may do so in any reasonable manner, but not in any way that suggests the licensor endorses you or your use.
ShareAlike — If you remix, transform, or build upon the material, you must distribute your contributions under the same license as the original.
No additional restrictions — You may not apply legal terms or technological measures that legally restrict others from doing anything the license permits.
Notices
You do not have to comply with the license for elements of the material in the public domain or where your use is permitted by an applicable exception or limitation .
No warranties are given. The license may not give you all of the permissions necessary for your intended use. For example, other rights such as publicity, privacy, or moral rights may limit how you use the material.
This deed highlights only some of the key features and terms of the actual license. It is not a license and has no legal value. You should carefully review all of the terms and conditions of the actual license before using the licensed material.
MIT for source code and code snippets
MIT License
Copyright (c) 2026, EVITA project, Ondrej Vysocky
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the “Software”), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
Note
To module authors: For code you may use any OSI-approved license as mentioned in https://spdx.org/licenses/, such as Apache License 2.0, GNU GPLv3, MIT. Please make sure to update the deed above and
LICENSE.code file accordingly.