Reference for Learners¶
Course Materials¶
All course materials are organized in the content/ directory:
Episodes: Makrdown episodes with hands-on examples
Images: Diagrams and visualizations in
episodes/images/Quiz: Example of testing questions in
episodes/quiz/
Common Issues & Solutions¶
RAPL MSR Registers Not Readable (Intel/AMD)
Issue: Permission denied when reading
/sys/devices/virtual/powercap/intel-rapl/Solution: Ensure you have appropriate permissions (run with
sudoor add user toperf_usersgroup). On some systems, disable Secure Boot or enable Model-Specific Register (MSR) access.
NVIDIA NVML Library Not Found
Issue:
ImportError: libcuda.so.1 not foundor NVML initialization failsSolution: Install NVIDIA GPU drivers and CUDA toolkit. Verify with
nvidia-smi. EnsureLD_LIBRARY_PATHincludes CUDA library path.
GPU Power Domain Missing (Grace Hopper)
Issue: Cannot differentiate CPU and GPU power consumption on Grace Hopper module
Solution: Use dual-interface approach—read Grace CPU power via HWMON
/sys/class/hwmon/*/device/power1_averageand GPU via NVMLGetTotalEnergyConsumption(), then compute GPU = Module - Grace.
RAPL Energy Counter Wraparound
Issue: Energy counter decreases instead of increasing (appears to go backward)
Solution: Implement wraparound handling for 32-bit counters. If E_new < E_old, add 2^32 to get true delta:
E_delta = (E_new - E_old + 2**32) % 2**32
Power Baseline Inconsistencies Across Node Types
Issue: Measured baseline differs significantly between identical hardware
Solution: Baseline is system-specific; account for firmware versions, BIOS settings, power delivery efficiency, and thermal management. Use node-type-specific models (e.g., separate baselines for CPU-only vs GPU-accelerated nodes).
Inaccurate Power Measurements Under Variable Load
Issue: Power readings don’t match facility-level measurements or seem to miss significant energy
Solution: Remember that in-band RAPL/NVML captures only partial node power. Unmeasured components (motherboard, interconnects, NIC) can add 30-50% to measured power. Use hierarchical monitoring: component-level + node PDU + facility PDU for complete accounting.
AMD GPU Power Access Denied
Issue: Permission denied when reading AMD GPU power with
amd-smi metric --powerSolution: User must be member of
videoandrendergroups. Run:sudo usermod -a -G video,render $USERand restart shell.
External Resources¶
Documentation¶
Intel RAPL Documentation - MSR register reference for Intel power measurement
AMD RAPL Specification - AMD Running Average Power Limit implementation
NVIDIA NVML API Reference - NVIDIA Management Library for GPU power monitoring
AMD SMI Documentation - AMD System Management Interface for GPU power
Linux hwmon Sysfs Interface - Standard power monitoring interface (GRACE CPU)
HDEEM Power Monitoring - Open-source high-resolution power measurement for HPC
Episode 0: Power Monitoring Introduction - Fundamental concepts of energy, power hierarchies, and measurement approaches
Episode 1: Power Monitoring Systems - Technical deep-dive into RAPL, NVML, HDEEM, and heterogeneous systems (Grace Hopper)
Training Materials¶
Getting Help¶
Consult the instructor-guide.md for teaching notes
Review solution notebooks in
solution/folderCheck error messages carefully - they often indicate the exact problem
Search Python documentation and community forums for specific issues
Course Feedback¶
Your feedback helps improve this course. Please share:
What topics were most helpful?
What could be improved?
What additional topics would you like to see?
Last Updated: May 2026 License: CC BY-SA 4.0