For the complete Mojo documentation index, see llms.txt. Markdown versions of all pages are available by appending .md to any URL (e.g. /docs/manual/basics.md).
AMDBufferResource
struct AMDBufferResource
128-bit descriptor for a buffer resource on AMD GPUs.
Used for buffer_load/buffer_store instructions.
Fields
- desc (
SIMD[DType.uint32, 4]): The 128-bit buffer descriptor encoded as four 32-bit values.
Implemented traits
AnyType,
Copyable,
ImplicitlyCopyable,
ImplicitlyDestructible,
Movable,
RegisterPassable,
TrivialRegisterPassable
Methods
__init__
__init__[dtype: DType](gds_ptr: UnsafePointer[Scalar[dtype], address_space=gds_ptr.address_space], num_records: Int = Int[UInt32](UInt32.MAX)) -> Self
Constructs an AMD buffer resource descriptor.
Parameters:
- dtype (
DType): Data type of the buffer elements.
Args:
- gds_ptr (
UnsafePointer[Scalar[dtype], address_space=gds_ptr.address_space]): Pointer to the buffer in global memory. - num_records (
Int): Number of records in the buffer.
__init__() -> Self
Constructs a zeroed AMD buffer resource descriptor.
get_base_ptr
get_base_ptr(self) -> Int
Gets the base pointer address from the buffer resource descriptor.
Returns:
Int: The base pointer address as an integer.
load
load[dtype: DType, width: Int, *, cache_policy: CacheOperation = CacheOperation.ALWAYS](self, vector_offset: Int32, *, scalar_offset: Int32 = Int32(0)) -> SIMD[dtype, width]
Loads data from the buffer using AMD buffer load intrinsic.
Parameters:
- dtype (
DType): Data type to load. - width (
Int): Number of elements to load. - cache_policy (
CacheOperation): Cache operation policy.
Args:
- vector_offset (
Int32): Offset in elements from the base pointer. - scalar_offset (
Int32): Additional scalar offset in elements.
Returns:
SIMD[dtype, width]: SIMD vector containing the loaded data.
load_to_lds
load_to_lds[dtype: DType, *, width: Int = 1, cache_policy: CacheOperation = CacheOperation.ALWAYS, async_copies: Bool = False](self, vector_offset: Int32, shared_ptr: UnsafePointer[Scalar[dtype], address_space=AddressSpace.SHARED], *, scalar_offset: Int32 = Int32(0))
Loads data from global memory and stores to shared memory.
Copies from global memory to shared memory (aka LDS) bypassing storing to register.
Uses the .ptr. form (descriptor as ptr addrspace(8)) over the
legacy <4 x i32> form. Both lower to the same MUBUF
buffer_load_*_lds instruction on gfx9x/CDNA, but the .ptr.
form exposes the descriptor as a typed pointer so
ScopedNoAliasAA / SIInsertWaitcnts can reason about it. The
legacy form produced a 0.76-abs MLA decode regression at
output[0,0,0,0] when used from attention DMAs — the .ptr.
form is MLA-safe.
Parameters:
- dtype (
DType): The dtype of the data to be loaded. - width (
Int): The SIMD vector width. - cache_policy (
CacheOperation): Cache operation policy controlling cache behavior at all levels. - async_copies (
Bool): If True, attach theamdgpu.AsyncCopiesalias scope to the load — unlocksScopedNoAliasAA-driven optimizations (LICM, CSE, reordering) and, on AMDGPU, the vmcnt relaxation inSIInsertWaitcnts(LLVM PR #74537): a laterds_readtaggednoaliasagainst the same scope can skips_waitcnt vmcnt(0). Only set True when (a) every LDS read of this data carries the matchingnoaliastag, AND (b) the kernel maintains an explicit runtime fence (s_waitcnt vmcnt(0)+s_barrier) — scheduling hints likes_sched_group_barrierdo NOT qualify. Defaults to False (safe for all callers). Future extension: the backend can bucket up to 8 distinct scopes independently (LDSDMAStores slots), so more scope variants could be added here if a kernel wants multiple independent DMA streams.
Args:
- vector_offset (
Int32): Vector memory offset in elements (per thread). - shared_ptr (
UnsafePointer[Scalar[dtype], address_space=AddressSpace.SHARED]): Shared memory address. - scalar_offset (
Int32): Scalar memory offset in elements (shared across wave).
store
store[dtype: DType, width: Int, *, cache_policy: CacheOperation = CacheOperation.ALWAYS](self, vector_offset: Int32, val: SIMD[dtype, width], *, scalar_offset: Int32 = Int32(0))
Stores a register variable to global memory with cache operation control.
Writes to global memory from a register with high-level cache control.
Note:
- Only supported on AMD GPUs.
- Provides high-level cache control via CacheOperation enum values.
- Maps directly to llvm.amdgcn.raw.buffer.store intrinsics.
- Cache control bits:
- SC[1:0] controls coherency scope: 0=wave, 1=group, 2=device, 3=system.
- nt=True: Use streaming-optimized cache policies (recommended for streaming data).
Parameters:
- dtype (
DType): The data type. - width (
Int): The SIMD vector width. - cache_policy (
CacheOperation): Cache operation policy controlling cache behavior at all levels.
Args:
- vector_offset (
Int32): Vector memory offset in elements (per thread). - val (
SIMD[dtype, width]): Value to write. - scalar_offset (
Int32): Scalar memory offset in elements (shared across wave).