IMPORTANT: To view this page as Markdown, append `.md` to the URL (e.g. /docs/manual/basics.md). For the complete Mojo documentation index, see llms.txt.
Skip to main content
Version: Nightly
For the complete Mojo documentation index, see llms.txt. Markdown versions of all pages are available by appending .md to any URL (e.g. /docs/manual/basics.md).

AMDBufferResource

struct AMDBufferResource

128-bit descriptor for a buffer resource on AMD GPUs.

Used for buffer_load/buffer_store instructions.

Fields

  • desc (SIMD[DType.uint32, 4]): The 128-bit buffer descriptor encoded as four 32-bit values.

Implemented traits

AnyType, Copyable, ImplicitlyCopyable, ImplicitlyDestructible, Movable, RegisterPassable, TrivialRegisterPassable

Methods

__init__

__init__[dtype: DType](gds_ptr: UnsafePointer[Scalar[dtype], address_space=gds_ptr.address_space], num_records: Int = Int[UInt32](UInt32.MAX)) -> Self

Constructs an AMD buffer resource descriptor.

Parameters:

  • dtype (DType): Data type of the buffer elements.

Args:

__init__() -> Self

Constructs a zeroed AMD buffer resource descriptor.

get_base_ptr

get_base_ptr(self) -> Int

Gets the base pointer address from the buffer resource descriptor.

Returns:

Int: The base pointer address as an integer.

load

load[dtype: DType, width: Int, *, cache_policy: CacheOperation = CacheOperation.ALWAYS](self, vector_offset: Int32, *, scalar_offset: Int32 = Int32(0)) -> SIMD[dtype, width]

Loads data from the buffer using AMD buffer load intrinsic.

Parameters:

  • dtype (DType): Data type to load.
  • width (Int): Number of elements to load.
  • cache_policy (CacheOperation): Cache operation policy.

Args:

  • vector_offset (Int32): Offset in elements from the base pointer.
  • scalar_offset (Int32): Additional scalar offset in elements.

Returns:

SIMD[dtype, width]: SIMD vector containing the loaded data.

load_to_lds

load_to_lds[dtype: DType, *, width: Int = 1, cache_policy: CacheOperation = CacheOperation.ALWAYS, async_copies: Bool = False](self, vector_offset: Int32, shared_ptr: UnsafePointer[Scalar[dtype], address_space=AddressSpace.SHARED], *, scalar_offset: Int32 = Int32(0))

Loads data from global memory and stores to shared memory.

Copies from global memory to shared memory (aka LDS) bypassing storing to register.

Uses the .ptr. form (descriptor as ptr addrspace(8)) over the legacy <4 x i32> form. Both lower to the same MUBUF buffer_load_*_lds instruction on gfx9x/CDNA, but the .ptr. form exposes the descriptor as a typed pointer so ScopedNoAliasAA / SIInsertWaitcnts can reason about it. The legacy form produced a 0.76-abs MLA decode regression at output[0,0,0,0] when used from attention DMAs — the .ptr. form is MLA-safe.

Parameters:

  • dtype (DType): The dtype of the data to be loaded.
  • width (Int): The SIMD vector width.
  • cache_policy (CacheOperation): Cache operation policy controlling cache behavior at all levels.
  • async_copies (Bool): If True, attach the amdgpu.AsyncCopies alias scope to the load — unlocks ScopedNoAliasAA-driven optimizations (LICM, CSE, reordering) and, on AMDGPU, the vmcnt relaxation in SIInsertWaitcnts (LLVM PR #74537): a later ds_read tagged noalias against the same scope can skip s_waitcnt vmcnt(0). Only set True when (a) every LDS read of this data carries the matching noalias tag, AND (b) the kernel maintains an explicit runtime fence (s_waitcnt vmcnt(0) + s_barrier) — scheduling hints like s_sched_group_barrier do NOT qualify. Defaults to False (safe for all callers). Future extension: the backend can bucket up to 8 distinct scopes independently (LDSDMAStores slots), so more scope variants could be added here if a kernel wants multiple independent DMA streams.

Args:

store

store[dtype: DType, width: Int, *, cache_policy: CacheOperation = CacheOperation.ALWAYS](self, vector_offset: Int32, val: SIMD[dtype, width], *, scalar_offset: Int32 = Int32(0))

Stores a register variable to global memory with cache operation control.

Writes to global memory from a register with high-level cache control.

Note:

  • Only supported on AMD GPUs.
  • Provides high-level cache control via CacheOperation enum values.
  • Maps directly to llvm.amdgcn.raw.buffer.store intrinsics.
  • Cache control bits:
  • SC[1:0] controls coherency scope: 0=wave, 1=group, 2=device, 3=system.
  • nt=True: Use streaming-optimized cache policies (recommended for streaming data).

Parameters:

  • dtype (DType): The data type.
  • width (Int): The SIMD vector width.
  • cache_policy (CacheOperation): Cache operation policy controlling cache behavior at all levels.

Args:

  • vector_offset (Int32): Vector memory offset in elements (per thread).
  • val (SIMD[dtype, width]): Value to write.
  • scalar_offset (Int32): Scalar memory offset in elements (shared across wave).