IMPORTANT: To view this page as Markdown, append `.md` to the URL (e.g. /docs/manual/basics.md). For the complete Mojo documentation index, see llms.txt.
Skip to main content
Version: 1.0
For the complete Mojo documentation index, see llms.txt. Markdown versions of all pages are available by appending .md to any URL (e.g. /docs/manual/basics.md).

st_matrix

st_matrix[dtype: DType, //, simd_width: Int, *, transpose: Bool = False](ptr: UnsafePointer[Scalar[dtype], address_space=AddressSpace.SHARED], d: SIMD[DType.float32, simd_width])

Performs warp-synchronized copy from registers to shared memory.

This function stores data from registers to shared memory in a format that can be directly used by tensor core Matrix Multiply-Accumulate (MMA) instructions. It uses the NVIDIA stmatrix instruction to perform an efficient warp-synchronized store.

Note: The function performs a warp-synchronized operation - all threads in the warp must execute this instruction to avoid deadlock.

Constraints:

  • Must be used with shared memory pointers.
  • Number of registers must be 1, 2, or 4.
  • Data must be properly aligned for matrix operations.
  • All threads in warp must participate.
  • Only supported on NVIDIA GPUs with tensor core capabilities.

Parameters:

  • dtype (DType): Data type of elements to store.
  • simd_width (Int): Width of the SIMD vector.
  • transpose (Bool): If True, transposes the matrix during store.

Args: