For the complete Mojo documentation index, see llms.txt. Markdown versions of all pages are available by appending .md to any URL (e.g. /docs/manual/basics.md).
ld_matrix
ld_matrix[dtype: DType, //, simd_width: Int, *, transpose: Bool = False](ptr: UnsafePointer[Scalar[dtype], address_space=ptr.address_space]) -> SIMD[dtype, simd_width]
Loads a matrix from shared memory into registers in a format suitable for tensor core operations.
This function performs a warp-synchronized load from shared memory to registers, formatting the data to be directly usable by tensor core Matrix Multiply-Accumulate (MMA) instructions.
Note:
- All threads in a warp must execute this operation together.
- For transposed loads, only half precision (float16) is supported.
- The register width is fixed at 4 bytes (32 bits).
- Supported configurations:
- x1: One 32-bit register per thread.
- x2: Two 32-bit registers per thread.
- x4: Four 32-bit registers per thread.
Example:
from std.gpu.compute.mma import ld_matrix
from std.memory import UnsafePointer, alloc
var ptr = alloc[Scalar[DType.float16]](8)
# Load 8x8 matrix of float16 values
var data = ld_matrix[simd_width=8](ptr)
# Load transposed matrix
var transposed = ld_matrix[simd_width=8, transpose=True](ptr)
ptr.free()
Parameters:
- dtype (
DType): The data type of the matrix elements (e.g. float16, float32). - simd_width (
Int): The width of the SIMD vector to load. - transpose (
Bool): Whether to transpose the matrix during load (only supported for half precision).
Args:
- ptr (
UnsafePointer[Scalar[dtype], address_space=ptr.address_space]): Pointer to shared memory containing the source matrix data.
Returns:
SIMD[dtype, simd_width]: SIMD vector containing the loaded matrix data, properly formatted for MMA operations.