Quiz: HPC Power Monitoring (Episodes 0 & 1)

Episode 0: Power Monitoring Introduction

Multiple choice, single answer

Energy vs Power Fundamentals:

What is the fundamental relationship between energy and power?

  • A) Energy is the rate at which power is consumed

  • B) Power is the rate at which energy is consumed

  • C) They are the same thing measured differently

  • D) Power is always zero if energy is constant

If a system consumes an average power of 50 kW for 2 hours, how much energy is consumed?

  • A) 25 kWh

  • B) 50 kWh

  • C) 100 kWh

  • D) 200 kWh

Measurement Approaches:

Why do component-level measurements (like RAPL) become unreliable for short jobs?

  • A) The CPU is too slow at short durations

  • B) They only capture partial node power, missing unmonitored components

  • C) Short jobs generate no heat

  • D) Linux cannot track short-running processes

Power Monitoring Hierarchy:

At which level of the monitoring hierarchy would you find CPU, memory, and NIC measurements?

  • A) Facility level

  • B) Rack level

  • C) Node-component level

  • D) PDU level

What is the main benefit of hierarchical power monitoring?

  • A) It costs less money

  • B) It provides different insights at different granularities for optimization

  • C) It eliminates the need for precise measurements

  • D) It requires no specialized hardware

In-Band vs Out-of-Band Monitoring:

Which monitoring approach provides real-time power data accessible to applications?

  • A) Out-of-band monitoring

  • B) In-band monitoring

  • C) PDU-based monitoring

  • D) Post-execution reporting

What is a key disadvantage of out-of-band monitoring?

  • A) High overhead on application execution

  • B) Cannot measure CPU power

  • C) No real-time feedback during job execution

  • D) Works only on Intel processors

Power Baseline Challenges:

What is the “measurement gap” in power monitoring?

  • A) The difference between wall power and utility-reported power

  • B) The gap between measured component power and actual total node power

  • C) The time delay in reading RAPL counters

  • D) The thermal margin on CPUs

Why must power baselines be system-specific?

  • A) It’s a vendor conspiracy to lock in customers

  • B) Different hardware, firmware, and environmental factors affect unmeasured component power consumption

  • C) Each data center is at a different altitude

  • D) BIOS updates always change power consumption

Conceptual questions

Energy Attribution Problem: RAPL reports that your CPU consumed 50 Joules during a computation, but node-level PDU measurements show 200 Joules. Where did the other 150 Joules go? List at least 3 possible components.

Measurement Strategy: You need to monitor power consumption of jobs on an HPC system for energy-aware scheduling. Would you choose in-band or out-of-band monitoring? Justify your choice considering overhead, accuracy, and real-time requirements.

Monitoring Hierarchy Design: Design a three-level power monitoring hierarchy for an HPC data center with 1000 nodes. Specify what gets measured at each level and why.

Episode 1: Power Monitoring Systems

Multiple choice, single answer

HDEEM (Bull/Atos):

What is the sampling frequency of HDEEM blade-level monitoring?

  • A) 100 Hz

  • B) 1 kHz

  • C) 10 Hz

  • D) 100 kHz

Which domains does HDEEM monitor at 100 Hz?

  • A) Only CPU power

  • B) CPU, DRAMs, NIC, VAUX

  • C) Only memory power

  • D) System-level power aggregates

What is HDEEM’s measurement accuracy uncertainty?

  • A) ±10%

  • B) ±5%

  • C) ±2%

  • D) ±0.5%

Intel RAPL:

Which RAPL domain is specific to server architectures and disabled by default?

  • A) Package

  • B) PP0/Core

  • C) DRAM

  • D) PSys/Platform

What are the two time windows in the Intel RAPL Package domain?

  • A) 1 ms and 10 ms

  • B) 100 ms and 1 second

  • C) 1.2× TDP (~ms) and 1× TDP (~second)

  • D) 1 second and 10 seconds

Which Intel architecture introduced the PSys/Platform domain?

  • A) Haswell

  • B) Broadwell

  • C) Skylake

  • D) Kaby Lake

AMD RAPL:

How does AMD RAPL differ from Intel RAPL in terms of power capping?

  • A) AMD supports more fine-grained capping

  • B) AMD provides read-only energy reporting, not power capping

  • C) AMD RAPL is faster

  • D) AMD RAPL works on client CPUs only

What is unique about AMD’s Core power domain?

  • A) It doesn’t exist

  • B) It provides per-core granularity

  • C) It’s not accessible to users

  • D) It only works with special firmware

Fujitsu A64FX:

What is exceptional about the A64FX’s sampling frequency?

  • A) 1 kHz like most systems

  • B) 100 kHz

  • C) Every cycle of the domain (cycle-accurate)

  • D) 10 Hz

Which cache level is separately monitored on A64FX?

  • A) L1 only

  • B) L2/LLC

  • C) L3 only

  • D) All cache levels combined

NVIDIA GPU Monitoring:

What does NvmlDeviceGetTotalEnergyConsumption() return?

  • A) Instantaneous power in Watts

  • B) Cumulative energy since GPU power-on in millijoules

  • C) Estimated energy based on clock frequency

  • D) Memory bandwidth in GB/s

Which power domain represents memory subsystem power in NVIDIA GPUs?

  • A) GPU_POWER

  • B) MEMORY_POWER

  • C) MODULE_POWER

  • D) NVML_POWER

AMD GPU Monitoring:

Which user groups must access GPU power data with AMD SMI?

  • A) root only

  • B) Any user

  • C) video and render groups

  • D) Admin and power-users

How does AMD SMI’s energy counter precision compare to NVIDIA?

  • A) More precise

  • B) Similar (±5-10%)

  • C) Much less precise

  • D) Cannot be compared

NVIDIA GRACE CPU:

What interface does NVIDIA GRACE CPU use for power monitoring?

  • A) RAPL via MSR

  • B) NVIDIA-proprietary API

  • C) Linux HWMON via sysfs

  • D) Custom kernel module

What is the sampling window for HWMON power measurements on GRACE?

  • A) 1-10 ms

  • B) 50-1000 ms

  • C) 1-10 seconds

  • D) 10-100 seconds

NVIDIA GRACE HOPPER:

How is the GPU power domain calculated in Grace Hopper?

  • A) Direct hardware measurement like RAPL

  • B) Derived by subtraction: Module - Grace

  • C) Not measured at all

  • D) Estimated from frequency scaling

What does the Module domain encompass in Grace Hopper?

  • A) CPU and GPU separately

  • B) CPU, GPU, and all interconnect power

  • C) GPU only

  • D) System IO only

Power Baseline Theory:

What is a characteristic of unmeasured component power in HPC nodes?

  • A) It’s always negligible

  • B) It varies significantly with workload type and intensity

  • C) It’s the same for all systems

  • D) It can be calculated from TDP alone

For jobs lasting only a few seconds, why are energy measurements problematic?

  • A) CPUs don’t consume power for short jobs

  • B) Short jobs are always energy-efficient

  • C) The unmeasured component baseline becomes significant relative to job energy

  • D) Measurement systems don’t work with short jobs

Coding and analysis questions

RAPL Energy Calculation: Assume you read MSR_PKG_ENERGY_STATUS at time t₀ = 0s and get E₀ = 0x12345678. At time t₁ = 60s, you read E₁ = 0x87654321. The MSR_RAPL_POWER_UNIT register shows energy_unit_multiplier = 61e-3 (Joules per unit).

  • a) Calculate the energy consumed (accounting for 32-bit wraparound if needed)

  • b) Compute average power during this interval

  • c) What happens if wraparound occurs? Show the calculation

Power Baseline Determination: You measure a node under baseline (idle): 40W for CPU, 20W for memory, 100W total node power. After calibration with synthetic load: 150W for CPU, 80W for memory, 350W total node power.

  • a) Calculate unmonitored component power in both states

  • b) Develop a simple linear model: P_unmonitored = f(P_cpu, P_mem)

  • c) Predict total node power for CPU=120W, Mem=60W

GPU vs CPU Power Comparison: Write pseudocode to:

  • a) Query GPU power using NVIDIA NVML: NvmlDeviceGetPowerUsage()

  • b) Query CPU power using Intel RAPL from sysfs

  • c) Compare and report which component is consuming more power

  • d) Handle errors (missing sensors, permission issues)

Grace Hopper Domain Analysis: Given Grace Hopper power readings:

  • Module: 300W

  • Grace: 200W (CPU + SysIO + DRAM)

  • CPU cores: 80W

  • SysIO: 40W

  • DRAM: 80W

  • a) Verify the Grace domain equation

  • b) Calculate GPU power

  • c) Compute efficiency ratio: (CPU + GPU) / (Total Module)

  • d) Identify which component is wasting power as infrastructure overhead

Baseline Comparison: Compare baselines from two systems: LUMI (European supercomputer) and Karolina (Czech supercomputer) using the visualization data in Episode 1.

  • a) Identify why baseline power differs between accelerated (ACN) and compute (CN) nodes

  • b) Estimate the cost difference of running identical jobs on ACN vs CN nodes

  • c) Propose an optimization strategy based on baseline analysis

RAPL Counter Management: Design a monitoring daemon that:

  • a) Reads RAPL counters every 10 seconds

  • b) Detects and handles counter wraparound (32-bit overflow)

  • c) Computes instantaneous power from energy deltas

  • d) Logs results to a database for post-execution analysis Pseudocode is sufficient.

System-Specific Baseline Model: Given the following empirical data for a compute node:

Workload Type

Measured Power (CPU+GPU) [W]

Total Node Power [W]

Unmeasured [W]

Idle

20

80

60

Synthetic (50% load)

100

220

120

Synthetic (100% load)

180

400

220

Real HPC app (50%)

95

210

115

Real HPC app (100%)

175

390

215

  • a) Is the unmeasured power linear or non-linear with measured power?

  • b) Fit a model: P_unmeasured = a + b × P_measured

  • c) Predict unmeasured power for measured=120W

  • d) What sources could cause non-linearity?

Coding questions

Generate a 1D NumPy array of 1 million random floats. Compute the square root of each element using:

  • a) a Python for loop

  • b) NumPy’s vectorized np.sqrt

Load a CSV file of weather data (e.g., temperature, humidity, wind).

  • a) filter rows where temperature > 30°C

  • b) compute the average humidity for each month using groupby

Create a random 100×100 matrix A and a vector b.

  • a) use scipy.linalg.solve to solve the system $Ax = b$

  • b) verify the solution by checking the residual norm

Simulate a DataFrame with missing values in numerical columns.

  • a) fill missing values with the column mean (using NumPy)

  • b) compute basic statistics before and after imputation

Generate noisy data for a quadratic function $y = ax² + bx + c$

  • a) use scipy.optimize.curve_fit to fit the data and recover the original parameters

  • b) plot the original vs fitted curve