TensorFloat-32

Floating-point formats
IEEE 754
  • 16-bit: Half (binary16)
  • 32-bit: Single (binary32), decimal32
  • 64-bit: Double (binary64), decimal64
  • 128-bit: Quadruple (binary128), decimal128
  • 256-bit: Octuple (binary256)
  • Extended precision
Other
  • Minifloat
  • bfloat16
  • TensorFloat-32
  • Microsoft Binary Format
  • IBM floating-point architecture
  • PMBus Linear-11
  • G.711 8-bit floats
Alternatives
  • Arbitrary precision
  • v
  • t
  • e

TensorFloat-32 or TF32 is a numeric floating point format designed for Tensor Core running on certain Nvidia GPUs.

Format

The binary format is:

  • 1 sign bit
  • 8 exponent bits
  • 10 fraction bits (also called mantissa, or precision bits)

The total 19 bits fits within a double word (32 bits), and while it lacks precision compared with a normal 32 bit IEEE 754 floating point number, provides much faster computation, up to 8 times on a V100.[1]

See also

References

  1. ^ https://deeprec.readthedocs.io/en/latest/NVIDIA-TF32.html accessed 23 May 2024

External links

Stub icon

This computer-engineering-related article is a stub. You can help Wikipedia by expanding it.

  • v
  • t
  • e