FloatGuard: Efficient Whole-Program Detection of Floating-Point Exceptions in AMD GPUs

Published in The 34th ACM International Symposium on High-Performance Parallel and Distributed Computing, 2025

Porting scientific applications across different GPU architectures introduces floating-point arithmetic variations that can affect reproducibility, making efficient detection and mitigation of exceptions like NaNs and infinities crucial. While NVIDIA has dominated the GPU market, AMD GPUs are increasingly used in HPC systems as well, yet existing floating-point exception detection frameworks focus on NVIDIA, leaving a gap for AMD GPUs. We present FloatGuard, the first framework for efficiently detecting floating-point exceptions in HIP programs running on AMD GPUs. FloatGuard leverages AMD GPU hardware registers to detect floating-point exceptions, overcoming the limitations of AMD’s built-in trapping mechanisms through a novel algorithm that combines assemblyand source-level instrumentation with debugger-guided execution. We evaluate FloatGuard on 565 HIP programs, detecting floatingpoint exceptions in 507 cases with a slowdown ratio that increases at most linearly with the number of exceptions discovered. Furthermore, we analyze the impact of compiler optimizations on exceptions trapped, and compare FloatGuard with the state-of-the-art tool for detecting floating-point exceptions in CUDA programs, which further reveals key differences between AMD and NVIDIA’s floating-point exception behaviors.

See the code on GitHub: github.com/LLNL/FloatGuard

Recommended citation: Dolores Miao, Ignacio Laguna, and Cindy Rubio-González. 2023. FloatGuard: Efficient Whole-Program Detection of Floating-Point Exceptions in AMD GPUs. In Proceedings of the 34th ACM International Symposium on High-Performance Parallel and Distributed Computing.