Version: Nightly

For the complete Mojo documentation index, see llms.txt. Markdown versions of all pages are available by appending .md to any URL (e.g. /docs/manual/basics.md).

AMDBufferResource

struct AMDBufferResource

128-bit descriptor for a buffer resource on AMD GPUs.

Used for buffer_load/buffer_store instructions.

Fields

desc (SIMD[DType.uint32, 4]): The 128-bit buffer descriptor encoded as four 32-bit values.

Implemented traits

AnyType, Copyable, ImplicitlyCopyable, ImplicitlyDestructible, Movable, RegisterPassable, TrivialRegisterPassable

Methods

`init`

__init__[dtype: DType](gds_ptr: UnsafePointer[Scalar[dtype], address_space=gds_ptr.address_space], num_records: Int = Int[UInt32](UInt32.MAX)) -> Self

Constructs an AMD buffer resource descriptor.

Parameters:

dtype (DType): Data type of the buffer elements.

Args:

gds_ptr (UnsafePointer[Scalar[dtype], address_space=gds_ptr.address_space]): Pointer to the buffer in global memory.
num_records (Int): Number of records in the buffer.

__init__() -> Self

Constructs a zeroed AMD buffer resource descriptor.

`get_base_ptr`

get_base_ptr(self) -> Int

Gets the base pointer address from the buffer resource descriptor.

Returns:

Int: The base pointer address as an integer.

`load`

load[dtype: DType, width: Int, *, cache_policy: CacheOperation = CacheOperation.ALWAYS](self, vector_offset: Int32, *, scalar_offset: Int32 = Int32(0)) -> SIMD[dtype, width]

Loads data from the buffer using AMD buffer load intrinsic.

Parameters:

dtype (DType): Data type to load.
width (Int): Number of elements to load.
cache_policy (CacheOperation): Cache operation policy.

Args:

vector_offset (Int32): Offset in elements from the base pointer.
scalar_offset (Int32): Additional scalar offset in elements.

Returns:

SIMD[dtype, width]: SIMD vector containing the loaded data.

`load_to_lds`

load_to_lds[dtype: DType, *, width: Int = 1, cache_policy: CacheOperation = CacheOperation.ALWAYS, async_copies: Bool = False](self, vector_offset: Int32, shared_ptr: UnsafePointer[Scalar[dtype], address_space=AddressSpace.SHARED], *, scalar_offset: Int32 = Int32(0))

Loads data from global memory and stores to shared memory.

Copies from global memory to shared memory (aka LDS) bypassing storing to register.

Uses the .ptr. form (descriptor as ptr addrspace(8)) over the legacy <4 x i32> form. Both lower to the same MUBUF buffer_load_*_lds instruction on gfx9x/CDNA, but the .ptr. form exposes the descriptor as a typed pointer so ScopedNoAliasAA / SIInsertWaitcnts can reason about it. The legacy form produced a 0.76-abs MLA decode regression at output[0,0,0,0] when used from attention DMAs — the .ptr. form is MLA-safe.

Parameters:

dtype (DType): The dtype of the data to be loaded.
width (Int): The SIMD vector width.
cache_policy (CacheOperation): Cache operation policy controlling cache behavior at all levels.
async_copies (Bool): If True, attach the amdgpu.AsyncCopies alias scope to the load — unlocks ScopedNoAliasAA-driven optimizations (LICM, CSE, reordering) and, on AMDGPU, the vmcnt relaxation in SIInsertWaitcnts (LLVM PR #74537): a later ds_read tagged noalias against the same scope can skip s_waitcnt vmcnt(0). Only set True when (a) every LDS read of this data carries the matching noalias tag, AND (b) the kernel maintains an explicit runtime fence (s_waitcnt vmcnt(0) + s_barrier) — scheduling hints like s_sched_group_barrier do NOT qualify. Defaults to False (safe for all callers). Future extension: the backend can bucket up to 8 distinct scopes independently (LDSDMAStores slots), so more scope variants could be added here if a kernel wants multiple independent DMA streams.

Args:

vector_offset (Int32): Vector memory offset in elements (per thread).
shared_ptr (UnsafePointer[Scalar[dtype], address_space=AddressSpace.SHARED]): Shared memory address.
scalar_offset (Int32): Scalar memory offset in elements (shared across wave).

`store`

store[dtype: DType, width: Int, *, cache_policy: CacheOperation = CacheOperation.ALWAYS](self, vector_offset: Int32, val: SIMD[dtype, width], *, scalar_offset: Int32 = Int32(0))

Stores a register variable to global memory with cache operation control.

Writes to global memory from a register with high-level cache control.

Note:

Only supported on AMD GPUs.
Provides high-level cache control via CacheOperation enum values.
Maps directly to llvm.amdgcn.raw.buffer.store intrinsics.
Cache control bits:
SC[1:0] controls coherency scope: 0=wave, 1=group, 2=device, 3=system.
nt=True: Use streaming-optimized cache policies (recommended for streaming data).

Parameters:

dtype (DType): The data type.
width (Int): The SIMD vector width.
cache_policy (CacheOperation): Cache operation policy controlling cache behavior at all levels.

Args:

vector_offset (Int32): Vector memory offset in elements (per thread).
val (SIMD[dtype, width]): Value to write.
scalar_offset (Int32): Scalar memory offset in elements (shared across wave).

Fields​

Implemented traits​

Methods​

__init__​

get_base_ptr​

load​

load_to_lds​

store​

Fields

Implemented traits

Methods

`init`

`get_base_ptr`

`load`

`load_to_lds`

`store`