Version: Nightly

For the complete Mojo documentation index, see llms.txt. Markdown versions of all pages are available by appending .md to any URL (e.g. /docs/manual/basics.md).

load_b_tr

load_b_tr[mma_shape: IndexList[3], swizzle: Optional[Swizzle] = Optional()](tile: LayoutTensor[address_space=AddressSpace.SHARED, element_layout=tile.element_layout, layout_int_type=tile.layout_int_type, linear_idx_type=tile.linear_idx_type, masked=tile.masked, alignment=tile.alignment]) -> SIMD[tile.dtype, 8]

Loads the b operand tile for AMD tensor core MFMA instructions using transposed memory access.

This function supports double-rate MFMA shapes (32x32x16, 16x16x32) with bfloat16 input. The input tile (shape = (mma_shape[2], mma_shape[1])) is split along the K dimension into two halves of shape (MMA_K//2, MMA_N). Each half is loaded using _load_tr16_b64_warp, which performs a transposed (column-major) load from shared memory. The resulting two 4-element SIMD vectors are concatenated into a single SIMD[tile.dtype, 8] vector.

Parameters:

mma_shape (IndexList[3]): The MMA instruction tile shape (only 32x32x16 or 16x16x32 supported).
swizzle (Optional[Swizzle]): Optional swizzle pattern for bank-conflict-free LDS access.

Args:

tile (LayoutTensor[address_space=AddressSpace.SHARED, element_layout=tile.element_layout, layout_int_type=tile.layout_int_type, linear_idx_type=tile.linear_idx_type, masked=tile.masked, alignment=tile.alignment]): A LayoutTensor, residing in shared memory, with shape (mma_shape[2], mma_shape[1]) and dtype DType.bfloat16.

Returns:

SIMD[tile.dtype, 8]: SIMD[tile.dtype, 8]: Concatenated transposed SIMD loads from both halves of the tile.