HPC HW POWER MONITORING: ITRODUCTION

ENERGY

Understanding Power vs Energy

Energy consumption is one of the most critical metrics in HPC systems, but it’s important to distinguish between power and energy:

  • Power (measured in Watts [W]) is the instantaneous rate at which energy is being consumed at a specific moment in time

  • Energy (measured in Joules [J] or Watt-hours [Wh]) is the total amount of work performed over a period of time

The relationship between them is captured by this fundamental formula:

$$Energy = Power \times Time$$

$$E [J] = P [W] \times \Delta t [s]$$

Practical Examples

To make this concrete, consider these conversions:

$$1 \text{ Watt} \times 1 \text{ second} = 1 \text{ Joule}$$ $$1 \text{ Watt} \times 1 \text{ hour} = 1 \text{ Watt-hour (Wh)} = 3,600 \text{ Joules}$$

Why does this matter for HPC? A system drawing 100 kW for just 10 minutes consumes significantly less energy than one drawing 50 kW continuously for 1 hour, even though the latter has lower peak power. In HPC centers and data halls, understanding this relationship is essential for:

  • Estimating cooling requirements

  • Budgeting operational costs

  • Optimizing job scheduling and workload distribution

  • Reducing carbon footprint

Measurement Approaches

Energy can be measured through different sampling methods, each with trade-offs:

../../_images/power2energy.svg

The visualization below illustrates three common approaches to estimate total energy from power samples:

TBD

TBD

Blade (HDEEM) samples [W]

../../_images/power_monitoring.png

POWER MONITORING

Power monitoring in HPC systems occurs at multiple hierarchical levels, each serving different purposes in understanding where energy is consumed and how to optimize it.

Monitoring Hierarchy

The monitoring stack extends from individual hardware components up to entire data centers:

On-Node Component Level (most granular):

  • CPU - Processor cores and their power states

  • ACC - Accelerators (GPUs, specialized processors)

  • Memory - DRAM and cache subsystems

  • NIC - Network interface cards

This level provides the highest granularity but requires access to hardware performance counters or specialized monitoring interfaces.

System-Level Monitoring (broader scope):

  • Node - Entire compute node aggregated power

  • Chassis - Multiple nodes in an enclosure (e.g., blade servers)

  • PDU - Power Distribution Unit monitoring at infrastructure level

Data Center and Infrastructure (facility perspective):

  • Rack - Power consumption of an entire rack

  • System - The complete HPC system

  • Data Hall - Entire data center

  • Building - Facility-wide energy accounting

HPC System Examples

Different HPC centers implement power monitoring across their infrastructure to meet various needs:

HPE Cray

OLCF Frontier

OLCF Frontier

Design Considerations

Why monitor at multiple levels?

  1. Accountability - Charge-back models and resource allocation require node-level or job-level energy metrics

  2. Optimization - Component-level data reveals which parts consume most power (e.g., memory vs. compute)

  3. Reliability - Infrastructure monitoring helps detect cooling problems and power delivery issues

  4. Predictive Maintenance - Facility-level trends can indicate aging equipment or degradation

  5. Workload Characterization - Understanding which jobs/users consume most energy enables scheduling optimization

Each monitoring level introduces different trade-offs between precision, overhead, cost, and actionability.

POWER MONITORING SYSTEMS FOR HPC

Modern HPC power monitoring systems vary widely in their implementation approach, each with distinct advantages and limitations suited to different deployment scenarios and requirements.

Key Dimensions of Variation

Power monitoring systems differ across several critical dimensions:

  • Data Source - Whether measurements come from hardware sensors, performance counters, PDU devices, or infrastructure meters

  • Accessibility - Whether data is accessible directly to users (in-band) or only to system administrators (out-of-band)

  • Temporal Resolution - Sampling frequency from milliseconds to seconds to minutes

  • Spatial Granularity - Component-level vs. node-level vs. system-level aggregation

  • Computational Overhead - The cost of monitoring itself and impact on application performance

  • Accuracy - Measurement precision and reliability across different workloads

  • Integration Complexity - Whether the system requires custom hardware modifications or uses existing infrastructure

  • Cost - Capital investment and operational expenses

Comparative Analysis

The following visualization compares leading power monitoring systems used in HPC centers:

images/8-systemscomparison.png

System Selection Criteria

When choosing a power monitoring system for an HPC facility, consider:

  1. Use Case - Are you tracking individual jobs, optimizing node efficiency, or managing facility-wide power budgets?

  2. Scale - Does your system need to monitor 10 nodes or 10,000?

  3. Accessibility Requirements - Do users need real-time access, or is post-execution reporting sufficient?

  4. Integration with Existing Infrastructure - Can you leverage vendor-provided monitoring, or do you need custom solutions?

  5. Accuracy Tolerance - Is ±5% accuracy acceptable, or do you need ±1%?

  6. Time-to-Deployment - Can you wait for development, or do you need immediate deployment?

Different HPC centers make different choices based on their specific operational constraints and research goals.

IN-BAND AND OUT-OF-BAND POWER MONITORING

One of the most fundamental architectural decisions in power monitoring systems is whether measurements should be accessible to end users (in-band) or restricted to system administrators (out-of-band). This choice has profound implications for system design, overhead, and usability.

In-Band Monitoring

Definition: Power data is accessible directly to users and applications during execution, typically through programming APIs or command-line interfaces.

Characteristics:

  • Vendor Specific - Usually implemented by hardware manufacturers (Intel, AMD, NVIDIA) and tightly integrated with their architectures

  • HW Performance Counters - Leverages built-in CPU/GPU counters and registers designed for performance analysis

  • Read from User-Space - Applications can query power metrics directly without privileged kernel access in many cases

  • Real-time Access - Users can monitor their jobs while they run and make adaptive decisions

Advantages:

  • Enables adaptive power management - applications can adjust behavior based on power consumption

  • Low overhead - measurements are typically fast register reads

  • Easy integration - can be embedded directly in application code

  • Immediate feedback - researchers can experiment with optimization strategies in real-time

Disadvantages:

  • Limited to vendor implementations - not all architectures are equally well-supported

  • May not capture complete system power (e.g., memory subsystem, interconnects)

  • Potential security concerns - users can observe power consumption patterns of other applications

  • Accuracy varies - estimates based on models rather than direct hardware measurements

Out-of-Band Monitoring

Definition: Power measurements are collected independently of application execution and accessible only to system administrators or through post-execution reporting.

Characteristics:

  • No Overhead to Applications - Monitoring infrastructure operates independently, adding zero computational overhead

  • High Overhead of Exposing to User-Space - Significant effort required to collect, store, and expose data to users

  • Custom Sensors - Often uses specialized hardware devices (power meters, PDU monitoring, thermal sensors)

  • Post-Execution Reporting - Users typically receive energy metrics after jobs complete

Advantages:

  • Zero application impact - no interference with computational work

  • Complete observability - can measure entire node/chassis power including all components

  • Security isolation - power data doesn’t leak between users or applications

  • Highly accurate - direct hardware measurements rather than estimates

  • Infrastructure-centric - enables facility-level power budgeting and management

Disadvantages:

  • No real-time feedback - users cannot adapt during execution

  • Higher capital cost - requires specialized hardware infrastructure

  • Deployment complexity - integration challenges with existing systems

  • Limited granularity - typically node-level rather than component-level insights

Comparative Analysis: RAPL vs DiG

Two contrasting approaches in modern HPC are illustrated below:

RAPL

DiG

RAPL (Running Average Power Limit) - Intel’s in-band solution providing user-accessible power measurements through performance counters on modern processors. Applications can query energy consumption in real-time with minimal overhead.

DiG (Distributed iDirect Gauge) - An out-of-band approach using dedicated monitoring hardware to capture complete system power without impacting application execution.

Design Trade-Off Matrix

Dimension

In-Band

Out-of-Band

User Access

Direct

Post-execution reporting

Application Overhead

Low

Zero

Measurement Completeness

Partial

Complete

Real-time Adaptation

Yes

No

Deployment Complexity

Low

High

Cost

Low

High

Accuracy

Moderate

High

Practical Guidance

Choose In-Band monitoring when:

  • Your research focuses on power-aware algorithms and adaptive optimization

  • You need real-time feedback for interactive exploration

  • Hardware budget is limited

  • Application performance impact must be minimized

Choose Out-of-Band monitoring when:

  • You need accurate facility-wide power accounting

  • Component-level energy breakdown is critical

  • Security and isolation are primary concerns

  • You’re building infrastructure for long-term operational management

NODE POWER BASELINE

The Energy Attribution Challenge

A fundamental problem in HPC power monitoring is the measurement gap: hardware components have heterogeneous instrumentation coverage. While some parts (like CPU cores) provide high-frequency power measurements through performance counters, many critical components lack direct monitoring.

The Measurement Hierarchy Problem

Consider a typical compute node:

  • High Frequency Energy Measurement of Some Components (millisecond resolution)

    • CPU cores (via RAPL, MSR counters, or similar)

    • Some GPU memory (via nvidia-smi on some hardware)

    • Individual accelerators (where vendor support exists)

  • Missing Energy Consumption of the Remaining Parts (no direct measurement)

    • Memory controllers and their power management logic

    • System interconnects (PCIe, InfiniBand, Ethernet)

    • Power conversion and delivery circuitry

    • Cooling and thermal management subsystems

    • Motherboard chipset and I/O controllers

  • Low Frequency Power Monitoring of the Whole Node (second to minute resolution)

    • PDU-level measurements

    • Baseboard Management Controller (BMC) readings

    • Out-of-band monitoring systems

The Estimation Gap

This creates a critical measurement problem:

$$\text{Unmeasured Components} = \text{Total Node Power} - \sum \text{Measured Components}$$

For short to medium-length jobs (seconds to minutes):

  • Component-level measurements show what the CPU/GPU consumed

  • Whole-node measurements show total consumption

  • The difference can be highly variable and unreliable to estimate

This makes short- and medium-length energy measurements unreliable without additional calibration.

Node Power Baseline Estimation Method

To estimate power consumption of non-monitored on-node components, researchers use an empirical calibration approach:

Procedure:

  1. Idle Node Measurement - Measure total node power when fully idle (all cores parked, no computation)

  2. Controlled Load Experiments - Load the node with a uniform, reproducible workload

    • Run synthetic benchmarks that stress all components uniformly

    • Common choices: LINPACK (HPL), STREAM, or specialized benchmarks

    • Measure at multiple power levels to create a calibration curve

  3. Model Development - Construct a simple energy model: $$P_{\text{node}} = P_{\text{baseline}} + P_{\text{cpu}} + P_{\text{unmonitored}}$$

    Where $P_{\text{unmonitored}}$ can be estimated from the difference between measured and actual consumption.

  4. Validation - Test the model against diverse real workloads to verify accuracy

System-Specific Nature

Critical insight: Power baseline is system-specific and must be evaluated for each system individually because:

  • Hardware variation - Different manufacturers, component vendors, and firmware versions have different overhead profiles

  • BIOS/firmware settings - Power management policies, thermal thresholds, and voltage regulators vary significantly

  • Environmental factors - Temperature, voltage supply variations, and cooling efficiency affect measurements

  • Workload interaction effects - Unmonitored components consume power based on unmeasured metrics (memory bandwidth, I/O traffic, thermal management activity)

Measurement Instruments and Examples

The following visualizations show real-world power baseline measurement data and instrumentation:

../../_images/10-chart.png
../../_images/10-HDEEM.png
../../_images/10-HDEEM-photo.png

HDEEM (High Definition Energy Efficiency Monitoring) - A specialized out-of-band monitoring system that captures detailed power metrics at high frequency and enables accurate baseline determination through controlled experiments.

Practical Implications

When using energy measurements in your research:

  1. Understand the coverage - Know what percentage of your node’s power is directly measured vs. estimated

  2. Design for accuracy - Run long-enough jobs that baseline variations become negligible relative to computation time

  3. Validate your assumptions - Compare model predictions against actual measurements for your workloads

  4. Document your methodology - Report which components are measured, which are estimated, and how the baseline was determined

  5. Use baselines consistently - Once established for a system, reuse the same baseline across experiments for reproducibility