CUDA Fast Math

As noted in Fastmath, for certain classes of applications that utilize floating point, strict IEEE-754 conformance is not required. For this subset of applications, performance speedups may be possible.

The CUDA target implements Fastmath behavior with two differences.

  • First, the fastmath argument to the @jit decorator is limited to the values True and False. When True, the following optimizations are enabled:

    • Flushing of denormals to zero.

    • Use of a fast approximation to the square root function.

    • Use of a fast approximation to the division operation.

    • Contraction of multiply and add operations into single fused multiply-add operations.

    See the documentation for nvvmCompileProgram for more details of these optimizations.

  • Secondly, calls to a subset of math module functions on float32 operands will be implemented using fast approximate implementations from the libdevice library.