IMPORTANT: To view this page as Markdown, append `.md` to the URL (e.g. /docs/manual/basics.md). For the complete Mojo documentation index, see llms.txt.
Skip to main content
Version: Nightly
For the complete Mojo documentation index, see llms.txt. Markdown versions of all pages are available by appending .md to any URL (e.g. /docs/manual/basics.md).

mma_nvidia

NVIDIA Tensor Cores implementation for matrix multiply-accumulate operations.

This module provides MMA implementations for NVIDIA GPUs with Tensor Cores, covering architectures from SM70 (Volta) through SM90 (Hopper).

Supported operations:

  • FP16 accumulation (SM70+)
  • FP32 accumulation with FP16/BF16 inputs (SM80+)
  • TF32 operations (SM80+)
  • FP8 operations (SM89+)

Reference: https://docs.nvidia.com/cuda/parallel-thread-execution/