Instructor guide¶

Example sections to include in the instructor guide are mentioned below.

Why we teach this lesson¶

Power monitoring is critical for modern HPC systems due to:

Operating cost reduction: Energy is typically 20-30% of HPC operating budgets. Understanding power enables workload optimization, job scheduling, and facility planning that directly reduce costs.
Environmental responsibility: HPC data centers consume megawatts of power with significant carbon footprint. Power-aware computing aligns with sustainability goals and regulatory requirements (EU Green Deal, carbon neutrality targets).
Performance optimization: Power is not just an operational concern—it constrains performance. Understanding power budgets, thermal limits, and power delivery constraints allows better performance/power trade-offs.
System reliability: Power monitoring enables predictive maintenance, thermal management, and prevents equipment damage from overheating or power delivery failures.
Workload characterization: Different applications have vastly different power signatures. Power data enables accurate energy accounting, fair resource billing, and identifying inefficient code patterns.
Emerging architectures: Heterogeneous systems (CPU+GPU, like Grace Hopper) require sophisticated power attribution. Students must understand these technologies to work with modern supercomputers.

This lesson transforms students from passive consumers of HPC resources into informed practitioners who can measure, analyze, and optimize power consumption.

Timing¶

Episode 0: Power Monitoring Introduction

Core lecture content: 60-90 minutes
- Energy vs Power fundamentals: 15 min (with examples)
- Power monitoring hierarchy: 15 min (system architecture discussion)
- Monitoring systems overview: 15 min (feature comparison)
- In-band vs Out-of-band: 15-20 min (detailed trade-offs)
- Power baseline concepts: 15-20 min (calibration methods)
Break: 10 minutes
Total: ~110-120 minutes

Episode 1: Power Monitoring Systems

Technical deep-dives: 120-150 minutes
- HDEEM system: 10-15 min
- Intel RAPL (Part 1 & 2): 25-30 min (architecture + MSR details)
- AMD RAPL: 10-15 min (compatibility focus)
- Fujitsu A64FX: 10-15 min (custom processor specifics)
- NVIDIA GPU (NVML): 15-20 min (API + practical usage)
- AMD GPU: 10-15 min (comparison with NVIDIA)
- NVIDIA GRACE CPU: 10-15 min (sysfs interface)
- NVIDIA GRACE HOPPER: 15-20 min (heterogeneous complexity)
- Case studies (LUMI, Karolina): 20-25 min (real data + models)
Break: 10 minutes
Total: ~140-170 minutes

Quiz Assessment

Episode 0 Quiz: 30-40 minutes (18 MCQ + 3 conceptual)
Episode 1 Quiz: 60-90 minutes (22 MCQ + 6 analysis questions)

Recommended structure: Teach Episode 0 on Day 1, Episode 1 on Day 2, Quiz assessment as homework or final session.

Hardware requirements¶

Minimum Setup for Demonstrations:

Linux workstation or access to HPC compute node with:
- Intel or AMD processor (2nd Gen Intel Xeon Scalable or newer, or AMD EPYC)
- BIOS settings allowing MSR access (no Secure Boot, or MSR module loaded)
- 4+ GB RAM
- 10 GB disk space

Optional but Recommended:

NVIDIA GPU (any modern generation) with CUDA toolkit and NVML library for GPU power monitoring demos
AMD GPU (MI series) with AMD ROCm for GPU power demos
Access to actual HPC system (e.g., LUMI, Karolina, Frontier) for real-world data examples
PDU or facility-level power monitoring infrastructure (for showing out-of-band measurements)

Software Stack:

Linux kernel 5.0+ with perf tools (linux-tools package)
Python 3.7+ with NumPy, SciPy, Matplotlib
intel-rapl-msr or equivalent RAPL reader (can use sysfs)
NVIDIA drivers + NVML library (if GPU demos intended)
amd-smi CLI tool for AMD GPU (if available)

Virtual Environment Compromise: If no physical hardware available, use recorded data and Python notebooks to demonstrate:

RAPL counter parsing and wraparound handling
Baseline model fitting with real supercomputer data (Karolina, LUMI provided in case studies)
Grace Hopper domain calculations with example power traces
Energy attribution algorithms

Important Note: Stress-testing systems during demo may require administrative approval and scheduling on production HPC systems. Plan ahead with system administrators.

Preparing exercises¶

Day Before (or 1 week before)

Verify RAPL access:

# Test Intel RAPL readability
cat /sys/devices/virtual/powercap/intel-rapl/intel-rapl:0/energy_uj

# If permission denied, either run as sudo or add user to perf_users group
sudo usermod -a -G perf_users $USER

Verify GPU drivers (if using NVIDIA/AMD):

nvidia-smi  # Should show GPU list
amd-smi list  # For AMD GPUs

Set up Python environment:

python3 -m venv power-monitoring-env
source power-monitoring-env/bin/activate
pip install numpy scipy matplotlib pandas

# Optional NVIDIA NVML Python bindings
pip install pynvml

Prepare demo data:
- Generate synthetic RAPL traces for wraparound demo
- Download actual baseline data from Karolina/LUMI if available
- Create sample power traces for Grace Hopper heterogeneous examples
Test all scripts in advance:
- Run any example code in episodes/ directory
- Verify sysfs paths exist and are readable
- Ensure visualizations render correctly

Day Of

System warmup: Boot systems 30 minutes early to allow thermal stabilization
Baseline establishment: Record idle power on all demo systems before teaching
Network setup: Ensure all student machines have Python environment ready
Backup plans: Have pre-recorded power traces ready in case live demos fail

Common Setup Tasks:

Create /tmp/power-demo/ directory with sample data files
Ensure demo notebooks can be run sequentially without manual intervention
Have instructor notes with expected output values for verification
Set up screen sharing or projector to show live power monitoring output

Other practical aspects¶

Classroom Dynamics:

This material is technical but not programming-heavy. Emphasis on understanding concepts over coding.
Expect diverse backgrounds: some students familiar with HPC architectures, others new to it.
Use real-world examples (Frontier, LUMI, Karolina) to ground abstract concepts in concrete systems.
Interactive polling: Ask students to predict power measurements before showing results (e.g., “What’s the CPU power at idle?”).

Engagement Strategies:

Demonstrate live power monitoring on the teaching system (power.draw changing in real-time)
Show Grace Hopper domain puzzle (how to separate CPU from GPU power)—students often find this intellectually satisfying
Discuss local HPC facility’s energy costs and how monitoring reduces bills
Connect to student research: “How much does your simulation cost in energy?”

Visual Aids:

Power hierarchy diagrams (component → node → rack → facility)
MSR register breakdowns for RAPL explanation
Time-series plots comparing in-band vs out-of-band measurements
Grace Hopper block diagram showing dual interfaces (HWMON + NVML)
Case study baseline graphs from LUMI/Karolina (dramatic differences between ACN and CN nodes)

Inclusivity Notes:

Some systems may not have RAPL (older CPUs, ARM-based, supercomputers with custom monitoring)—acknowledge this early
Not all GPU types supported; discuss compatibility upfront
Power measurement is often restricted on shared systems; explain why and offer virtualization alternatives
Provide pre-recorded data for students without hardware access

Assessment Strategy:

Quiz split into conceptual (Episode 0) and technical (Episode 1) components
Emphasize that Grade Hopper domain calculation is a skill, not memorization—students should be able to reason about it
For coding questions, accept pseudocode; the logic matters more than syntax
Real baseline analysis (final question) tests synthesis of multiple concepts

Interesting questions you might get¶

Q: “Why can’t we just use the facility PDU power and divide by node count?” A: Good instinct! But: (1) Facility overhead (cooling, power delivery) adds 30-40%, (2) Not all nodes draw equal power, (3) Power varies with workload, (4) Want real-time per-job attribution. Component-level monitoring solves these but at cost of measurement gaps. This is the trade-off hierarchy.

Q: “If RAPL is inaccurate (±5-10%), why use it at all?” A: It’s not perfectly accurate, but it’s: (1) Real-time and zero overhead, (2) Per-component granularity, (3) Accessible without infrastructure, (4) Sufficient for most optimization decisions. Out-of-band is more accurate but slower. Different tools for different purposes.

Q: “Can we cap power like we cap memory?” A: Excellent question! Intel RAPL has power capping (via time windows), but it’s coarse. AMD RAPL is read-only. Real power capping requires hardware frequency scaling + voltage adjustment. Grace Hopper and newer systems have finer control. This is an active research area.

Q: “Why does unmeasured power increase with workload?” A: Multiple components scale with load: (1) Memory controller busier = more interconnect power, (2) Higher CPU frequency = worse power delivery efficiency, (3) More thermal management activity, (4) Chipset/NIC load increases. Not linear because these effects compound.

Q: “What if two runs have different power but same workload?” A: Environmental factors: (1) Room temperature affects thermal leakage, (2) Voltage fluctuations on power rails, (3) Firmware/BIOS version differences, (4) Background system services, (5) CPU frequency governor variability. This is why power baselines must be empirically calibrated, not theoretical.

Q: “Can we use power for security (side-channel attacks)?” A: Yes! Power consumption leaks information about computation (timing attacks, speculative execution patterns). This is why out-of-band monitoring is more secure—prevents applications from inferring other jobs’ data through power side-channels. Relevant for shared systems.

Q: “Which monitoring system should we use?” A: Depends on trade-offs: (1) Need real-time during job? Use in-band (RAPL/NVML). (2) Need complete accounting? Use out-of-band (HDEEM/PDU). (3) Have Fujitsu A64FX? Custom cycle-accurate monitoring. (4) Grace Hopper? Use HWMON + NVML dual approach. No one-size-fits-all answer.

Q: “Why is Grace Hopper’s GPU power calculated by subtraction?” A: Because there’s no direct GPU power sensor! Module = CPU+GPU+interconnect, Grace = CPU+SysIO+DRAM, so GPU = Module - Grace. Noisy but works. Future designs should have separate sensors. This is a systems design trade-off (cost vs precision).

Q: “Can we predict power from code?” A: Partially! Cycle-accurate simulators (Gem5) can model power, but real systems have too many unknowns. Machine learning (train on labeled power traces) shows promise but needs careful validation. Hybrid approach: predict trends, use measurement to calibrate. Active research area.

Typical pitfalls¶

Misconception 1: “Watts = Joules”

The Problem: Students confuse power (W) with energy (J). They might say “my job uses 50 watts.”
Reality: Power is rate (watts), energy is total work (joules). Same job could be 50W for 10 minutes OR 100W for 5 minutes—both consume 30 kJ.
How to catch it: Ask “is 50 watts a lot?” without context—it’s unanswerable. Need to know duration.
Teaching tip: Always use the equation $$E = P \times \Delta t$$ and calculate numerical examples in both Joules and Wh.

Misconception 2: “RAPL measures total node power”

The Problem: Students trust RAPL as gospel and don’t realize unmeasured components (memory controllers, NIC, power delivery losses) add 30-50%.
Reality: RAPL captures only Package + DRAM + PP0/PP1, missing system interconnect, chipset, thermals.
How to catch it: Compare RAPL sum to facility PDU power—always lower.
Teaching tip: Show the measurement gap explicitly. Use baseline calibration to quantify unmeasured power.

Misconception 3: “Higher frequency = higher power”

The Problem: Students assume linear relationship: 2× frequency = 2× power.
Reality: Power ∝ V² × f. Voltage scaling is nonlinear, and frequency doesn’t scale linearly across all components.
How to catch it: Show empirical RAPL data at different CPU frequencies—relationship is superlinear but not 2×.
Teaching tip: Discuss DVFS (Dynamic Voltage and Frequency Scaling) and why both frequency AND voltage must be considered.

Misconception 4: “Baseline power is fixed”

The Problem: Students measure idle power once and assume it’s constant forever.
Reality: Baseline varies with temperature, voltage supply quality, firmware version, thermal management state.
How to catch it: Show baseline measurements from same system over time—they drift.
Teaching tip: Emphasize that baselines are empirically calibrated, system-specific, and should be validated periodically.

Misconception 5: “Grace Hopper GPU power can be read directly”

The Problem: Students expect a GPU power sensor, get confused when it doesn’t exist.
Reality: Module sensor exists, but GPU is computed as Module - Grace.
How to catch it: Ask “how do you measure GPU power on Grace Hopper?” If they don’t mention subtraction, redirect.
Teaching tip: Make this a puzzle. Have students derive it: given Module, Grace, what’s GPU?

Misconception 6: “AMD GPU and NVIDIA GPU power APIs are identical”

The Problem: Students write generic code expecting it to work on both.
Reality: Different function names (GetPowerUsage vs get_power_info), different units, different permission requirements (video/render groups for AMD).
How to catch it: Have them write pseudocode for both—forces confrontation with differences.
Teaching tip: Show side-by-side API comparison table. Emphasize vendor-specificity.

Misconception 7: “RAPL counter wraparound is rare”

The Problem: Students ignore wraparound handling, lose data on long-running measurements.
Reality: 32-bit counter wraps every ~1 hour at high power. Must handle in real code.
How to catch it: Ask during assessment: “What happens if you don’t check for wraparound?” Watch them realize the bug.
Teaching tip: Demonstrate wraparound with synthetic data before real problem. Show the fix: if E_delta < 0: E_delta += 2**32.

Misconception 8: “Out-of-band is always better”

The Problem: Students assume higher accuracy = better, don’t consider latency/overhead.
Reality: HDEEM is accurate but adds complexity. RAPL is faster but incomplete. No “best”—depends on use case.
How to catch it: Ask “which would you use for a real-time power cap?” or “for post-hoc analysis?”
Teaching tip: Present trade-off matrix. Some decisions require both in-band and out-of-band.

Misconception 9: “All x86 CPUs have RAPL”

The Problem: Students assume RAPL is universal; it’s not.
Reality: Only Intel Sandybridge+ and AMD Zen+. Older systems, ARM, custom processors don’t have it.
How to catch it: Have them check their own system: does /sys/devices/virtual/powercap/intel-rapl/ exist?
Teaching tip: Show compatibility chart early. Emphasize that monitoring landscape is heterogeneous, not standardized.