Reference for Learners

Course Materials

All course materials are organized in the content/ directory:

  • Episodes: Makrdown episodes with hands-on examples

  • Images: Diagrams and visualizations in episodes/images/

  • Quiz: Example of testing questions in episodes/quiz/

Common Issues & Solutions

RAPL MSR Registers Not Readable (Intel/AMD)

  • Issue: Permission denied when reading /sys/devices/virtual/powercap/intel-rapl/

  • Solution: Ensure you have appropriate permissions (run with sudo or add user to perf_users group). On some systems, disable Secure Boot or enable Model-Specific Register (MSR) access.

NVIDIA NVML Library Not Found

  • Issue: ImportError: libcuda.so.1 not found or NVML initialization fails

  • Solution: Install NVIDIA GPU drivers and CUDA toolkit. Verify with nvidia-smi. Ensure LD_LIBRARY_PATH includes CUDA library path.

GPU Power Domain Missing (Grace Hopper)

  • Issue: Cannot differentiate CPU and GPU power consumption on Grace Hopper module

  • Solution: Use dual-interface approach—read Grace CPU power via HWMON /sys/class/hwmon/*/device/power1_average and GPU via NVML GetTotalEnergyConsumption(), then compute GPU = Module - Grace.

RAPL Energy Counter Wraparound

  • Issue: Energy counter decreases instead of increasing (appears to go backward)

  • Solution: Implement wraparound handling for 32-bit counters. If E_new < E_old, add 2^32 to get true delta: E_delta = (E_new - E_old + 2**32) % 2**32

Power Baseline Inconsistencies Across Node Types

  • Issue: Measured baseline differs significantly between identical hardware

  • Solution: Baseline is system-specific; account for firmware versions, BIOS settings, power delivery efficiency, and thermal management. Use node-type-specific models (e.g., separate baselines for CPU-only vs GPU-accelerated nodes).

Inaccurate Power Measurements Under Variable Load

  • Issue: Power readings don’t match facility-level measurements or seem to miss significant energy

  • Solution: Remember that in-band RAPL/NVML captures only partial node power. Unmeasured components (motherboard, interconnects, NIC) can add 30-50% to measured power. Use hierarchical monitoring: component-level + node PDU + facility PDU for complete accounting.

AMD GPU Power Access Denied

  • Issue: Permission denied when reading AMD GPU power with amd-smi metric --power

  • Solution: User must be member of video and render groups. Run: sudo usermod -a -G video,render $USER and restart shell.

External Resources

Documentation

Training Materials

Getting Help

  • Consult the instructor-guide.md for teaching notes

  • Review solution notebooks in solution/ folder

  • Check error messages carefully - they often indicate the exact problem

  • Search Python documentation and community forums for specific issues

Course Feedback

Your feedback helps improve this course. Please share:

  • What topics were most helpful?

  • What could be improved?

  • What additional topics would you like to see?


Last Updated: May 2026 License: CC BY-SA 4.0