Thoughts on GPU abstractions

Triton is great, tile programming is great, but can we do better?

The things that makes Triton great:

A programming model that hits the sweet spot between high-level abstraction and low-level implementation concerns. The programming model provides access to thread parallelism via bulk operations on arrays that users are familiar with, and abstracts out inter thread synchronization, the layout of arrays in memory, and async execution semantics. Programmers express the rough decomposition of the parallel problem on the tile/block level.
The expressiveness of the language is sufficient to cover almost all machine learning operators and workloads. And prior to hopper the compiler was good enough to achieve very high performance.

The things that Triton is lacking in:

The programming model no longer captures well the GPU memory hierarchy, and newer GPU features such as async memory loading, async synchronization, async tensor cores. To be more precise, to obtain a great performance requires finer control of warps, a level of parallelism that is smaller than a block, and usage of distributed shared memory and inter-cluster communication, where a cluster is an unit of parallelism larger than a block.
The programming model is no longer expressive enough to handle complex operators and fusions that we care about. This is because of newer architectures that use sparse operations which do not map onto tiles well, and usually sparse-like algorithms like top-k and sorting requires direct access to memory. Also, the emergence of newer hardware features enable more complex fusions that are not expressible in the tile programming model as it exists in Triton, and we cannot expect the compiler to perform these automatically.

The role of a good programming language is to provide an uniform abstraction boundary that splits cleanly the responsibility of the programmer and the compiler - the programmer is responsible for expressing logic that is made easier through the abstractions, while the compiler handles the lowering of those abstractions to underlying primitives in a way that should be fairly deterministic and route.

In programming languages for the CPU where performance is not the biggest concern, the language designer can have much more liberties abstracting out the hardware details, because some amount of overhead in the lowering is acceptable. Some good abstractions that come to mind in this domain are the procedure abstraction over assembly, the overhead being the calling convention that can be largely removed through function inlining. The abstraction over registers is a pure win as variables are treated more uniformly as places in memory, while the compiler can do a much better job assigning registers to variables through the mechanical register scheduling problem. Meanwhile, treating functions as first class citizens means some overhead in memory and indirection at the boon of much greater expressivity and code reuse. Garbage collection is also a win for non-performance sensitive tasks by automating the often route placement of malloc and free at the cost of higher latencies (some garbage collectors may be more performant than manually managed memory depending on the workload, the garbage collector and manual allocation strategy). There are also some wins in programming language design that do not affect performance, the so called zero cost abstractions that trades compile time and often time code size for the abstraction. This includes type systems and generics.

Another boon for CPU programming language designers is that the hardware execution semantics and features do not change much. If there were suddenly to appear a hardware feature (in CPUs) that lifts performance by a significant factor but is necessarily low level and not captured by existing language features. For example, a programmable L1 cache, or hardware accelerated barrier, or more exotic memory consistency model. Then language designers must make the difficult decision to include these features for the performance, or abstract these features away and rely on the compiler. For the programmable L1 cache example, I do not imagine functional programming languages exposing this detail, while lower-level languages could in the design space.

Unfortunately neither of these things are true for the GPU, because 1) performance is paramount, it is what justifies the use of a GPU in the first place and 2) Nvidia and AMD comes up with new and exotic hardware features for nearly each generation. The use of these features is required for maximum performance. This means that to abstract at a higher level than CUDA, language designers must contort their languages to expose each hardware feature in some way for each GPU generation release. We can see this in Triton's case: it's programming model sufficed for older generation GPUs prior Nvidia's Hopper, but Hopper and Blackwell introduced new hardware features that enables programming patterns like warp specialization (where each warp acts like a separate thread of execution within a block and executes different instructions), that Triton cannot express, so the lower-level gluon programming model was introduced that exposes these features through instructions that directly map onto assembly instructions.

Programming for the GPU also has a different cost model with respect to developer time and maintenance risk. GPU kernels are often times small, for the most complex kernels at most 1-2k lines. Each kernel is encapsulated and performs some operation on it's global memory inputs, independent of the inner states of other kernels. Compared to the CPU which can perform IO and perform a myriad of tasks, a GPU only performs mathematical functions over it's inputs, so a library of operators covering common mathematical functions is sufficient. New kernels are needed when someone invents a new operator, which is unlikely, or when fusion is needed to perform a composition of existing operations faster, which is far more likely. All of this makes the common concerns of programming language designers: code reuse, composition and terseness, less important. As the trade-off between engineering effort and performance was very skewed until recently: on the low-engineering effort tail for scientists, what's important is exploration and iteration time, so just compose existing operators in a library like PyTorch or Jax and let the compiler optimize what it can. On the high-engineering effort tail, we are using GPUs on a mathematical function that is used widely, such as the recommendation algorithms in Youtube, so we can manually specialize kernels in that function to achieve the highest performance. However, recently, there emerged the industrial application of machine learning, so both high performance and a large variety of operators are needed. We can see this in the open source LLM space, there are large variety of models using different architectures, and the size of these models requires optimization that cannot be achieved by compilers of today.

There is a need and use case now for better GPU programming models that trades off performance and engineering time more finely. As the number of operators requiring high performance grows, there is a need from both ends. The GPU programming side where greater productivity is needed to write operators, and the compiler side where higher performance is needed to fuse existing composition of operators.

The atoms of GPU computation

What are the atoms of GPU computation? Before that, what do I mean by an atom?

In this word, I am questing after an idea or concept that all models of parallel GPU computation must operate on. And why do I specify GPU computation and not all models of parallel computation? Because GPU computation refers to a specific subset of parallel computation: a Von-Neumann model where each "context" of execution has a shared instruction stream, but different instruction pointers, instructions are separated from data and both need to be fetched across a bus to the processor.

In such a model, units of execution may be grouped hierarchically, the hierarchy corresponds to their proximity on the hardware and determines the cost of various operations.

We may distinguish between two groups of units of execution with different semantics: groups where each unit has its own instruction pointer, so execution between units can diverge, and groups where each unit strictly executes the same instruction, and branching is handled by masking in the unit. Of course there are more variants than this, like CUDA's warp execution model where each lane in a warp get's an instruction pointer and makes forward progress despite divergence, but very roughly the distinction speaks to the cost of thread divergence - the former case has no cost while the latter imposes some cost.

Communication between different execution units is done through shared memory, not to be confused with __shared__ memory in CUDA, shared memory is simply memory that is visible to all units in the communicating group. To communicate between unit A and B, we must guarantee that the two units' instruction pointers have or had reached a certain location, and the effects of instructions prior to that point have traversed the data-path and is visible in the shared memory. The first is called synchronization and the second is called acquire-release semantics.

There are various mechanisms to achieve synchronization. One way is through a barrier where units on a group wait on until all units have arrived. Another is through memory if the hardware provide guarantees: one unit arrives and modifies a piece of shared memory, if the other units observes the memory changed to a certain value then it means that the first one has reached a certain point. This is only possible if the hardware guarantees that the write will be eventually visible, and the read actually reads from the shared memory and not some cache in between. (This always guaranteed in x86 CPUs, but not the case for Nvidia GPUs under regular load and store instructions, which does not bypass the L2 cache.)

Making sure the memory is visible to all units can be either achieved through a fence mechanism or a acquire-release tag mechanism. The fence mechanism similar to CPUs ensures that caches are flushed to the shared memory, and waits on that to happen (which is why they must be placed after the synchronization point because units that have not executed the memory effect will not be flushed). The acquire-release tag mechanism similar to CPUs assigns a piece of memory as a tag: for the release producer, the tag changes in value after the writes become visible, and for the acquire reader, the tag changes in value signifies that the producer's value exists in shared memory.

Often times, these two concepts, the concept of synchronization and memory consistency are combined. For example, CUDA __syncthreads() both ensures the instruction pointers of all threads have reached that location, as well and ensuring that all writes to __shared__ memory are visible before continuing. In effect, it is a combination of a barrier and fence.

Finally, for performant parallel computation it makes sense to introduce an unit hierarchy where synchronization and communication costs scales with the distance in the hierarchy. The physical hierarchy is manifested through the hardware architecture, where two execution units may be placed in the same core, different cores, or different devices with different data-path pipes connecting them. Meanwhile a conceptual hierarchy exists to organize synchronization instructions according to their scope, which are logical groupings of execution units that correspond to the physical hierarchy. For example in Nvidia, the newest generation GPUs are grouped physically into 2 micro-GPUs, which are chiplets containing a local slice of off-chip dram, the chiplets communicate through a stitch and there is a penalty for accessing memory off the micro-GPU because the stitch bandwidth is lower than the chiplet to local dram bandwidth. Each chiplet contains some stream-multiprocessors (SMs), which are individual cores containing it's local L1-cache, register file, ALU, FP and tensor core units. The SMs in each micro-GPU share a large L2 cache, but no global hardware accelerated barrier mechanism (for now). Meanwhile in each SM, there are a certain number of warp execution units that issue instructions, with their own instruction cache. The warp execution units contain their own register file and tensor memory (in blackwell), while __shared__ memory is resident in L1 cache that is shared between the warp execution units. The logical groups of Nvidia's hardware is as follows: sys, gpu, cluster, cta, which corresponds to the memory hierarchy and communication costs in that hierarchy. cta level corresponds to L1 cache, cluster corresponds to L2 cache, gpu corresponds to L2 or dram, and sys corresponds to communication over nvlink or pcie, between GPUs or with the CPU.

This hierarchy has implications on synchronization and acquire-release semantics. For one, synchronization between units on a "higher" scope should be cheaper than a lower scope, meanwhile the hardware may or may not provide barriers between units at the lowest scope. For the other, each acquire-release pair must specify the scope that the memory must be consistent under, for example under cta scope it's the L1 cache, under the cluster scope it's the L2 cache.

So there we have it, the atoms of GPU computation are:

A memory and execution hierarchy that groups units of execution into logical scopes
Synchronization operations between units at a logical scope through hardware barriers
Synchronization operations between units at a logical scope through acquire-release memory transactions
Communication operations between units in the same or different logical scopes - this is combined with the two synchronization mechanisms to achieve consistency.

I think that no matter what fancy hardware features come up, these atoms will always be present. The goal of the programming language is to find abstractions over these atoms that makes expressing parallel algorithms easier. Whatever abstractions may come and go but the atoms will always be present. I think a good way of thinking about domain specific languages in the GPU space and programming languages in general is: if you are on the application side, then welcome every tool and know that no tool is good for everything, if you are on the tools and frameworks side, your tool is the best and you should try to be better than everyone else. Triton and similar tile programming models simply matters by flattening the GPU hierarchy to blocks only, providing a simplified view of intra-cta programming through a curated set of collective operations on tiles.

The abundance and poverty of abstractions

In the CPU language landscape there is such an abundance of abstractions proposed by programming language (PL) PHDs. In the fertile land where cycles are cheap and memory is abundant, we can reap higher-order functions, monads, algebraic effect handlers, software transactional memory, inheritance, classes, abstract interfaces and dynamic dispatch.

In GPU programming we have compile time abstractions organizing parallelism and automating some aspects of synchronization, but nowhere near the wealth of abstractions for general purpose programming languages. This is not helped by the fact that hardware vendors like Nvidia keeps introducing new damn features every couple of years.

The motivation behind the CPU PL abstractions (from what I listed) is code reuse, developer productivity, correctness. They optimize for a regime where developer time is the most valuable resource. These new GPU features, from the way I see it, aim to address the growing disparity between the capabilities of various pipelines: the classic compute memory gap where DRAM memory accesses are 3 orders of magnitude slower than compute, the bandwidth-latency gap where GPU HBM memory bandwidth grows but the latency of access remains stagnant, the mismatch between even compute elements on the core - i.e. the tensorcore pipeline in blackwell having much higher throughput than math pipelines. (TODO: provide references, benchmarks)

These features aim to provide a way to apply another atom of GPU computation - overlapping - to hide the resource mismatch. Classic techniques overlap computation with memory transfer to saturate both pipelines, but with newer GPUs, it becomes necessary to apply overlapping within the CTA to compute pipelines as done in flash attention 4, which overlaps tensor core with the floating point and transcendental units. Newer GPU feaures exposes async semantics, multiple streams of execution, synchronization and more hierarchies and primitives for cooperation for the purpose of overlap and latency hiding.

It's important to note that GPU overlapping techniques are only applicable to a class of workloads where it is possible to statically schedule through software, to be able to describe a computation graph of what happens, because otherwise you would not know what comes next and overlap. Meanwhile, a host of techniques on the CPU side such as speculative execution and branch prediction are created for dynamic workloads where software scheduling is not possible or efficient. The GPU eliminates these hardware features for software scheduling, at the cost of program complexity that is not handled well by current language design (namely the incumbant CUDA and CUDA-like APIs).

GPU abstractions that we have focus on collective abstractions over threads and blocks, where operations describe a set of threads or blocks operating in an atomic single threaded way. For example, tiles describe collectively the threads in a block that operate in parallel over data, with single threaded semantics for the collective operations within a block. Tensor Comprehensions do this for blocks in a grid. What these techniques miss and leave performance on the table is better schedules within their collective scope - within the block on the warp level for tiles and within the grid on the block level for tensor comprehensions.

In response to the need for finer scheduling caused by mismatch of throughput and latency within the SM, the compute core of the GPU, newer works like Tilelang and Tilus expose collective operations on the warp level. This works for dense compute workloads and intra-CTA scheduling as the smallest unit of independent execution in a GPU is the warp. Lanes within a warp either execute in lockstep as in AMD or have their own instruction pointers and thus can execute concurrently but incur a penalty for divergence, as in Nvidia's GPUs (since Volta). What these works miss in modeling each warp as having a single instruction pointer when in reality there are 32 instruction pointers (since Volta) is an inability to express thread level operations where control flow is data-dependent. For example:

// on a thread
while (x < radius) {
  x = x * x + C;
}

Under the SIMT execution model, each warp diverges and iterates for an unknown number of times before converging somewhere later, I believe there are SASS instructions that denote when a warp is executing in a lockstep state or diverged state, and the PTX to SASS compiler ptxas will insert these warp divergence and reconvergence instructions. Anyhow this pattern cannot be directly expressed in the warp tile model, but can be simulated via a boolean mask:

// on a warp tile 
// mask: 32 bit mask, or 32 byte mask
while (reduce_any(mask)) {
  x1 = x * x + C;
  x = where(mask, x1, x);
}

this may come at a performance cost due to the extra synchronization across threads.

What I sense the need that's not being addressed is the growing complexity that results from the growing GPU hierarchy and breaking down of the original CUDA programming model. The original CUDA programming model treats threads and blocks as the basic units of parallelism but increasingly newer levels of parallelism are introduced, such as thread-block clusters and multi-GPUs. Threads are hidden behind the dynamic warp scheduler while blocks are hidden behind the dynamic block scheduler, but as pipelines become more imbalanced, these abstractions are no longer holding water for the highest possible performance.

The complexity is handling the multiple layers of communication and synchronization at the lane, warp, block, cluster and GPU level across registers, shared memory (L1), distributed shared memory (L2), global and remote peer memory. As well as scheduling work to be as overlapped as possible, which changes depending on the workload context. For example, there's a trade off between latency and throughput in every workload that changes kernel design. Low latency situations requires changes in instructions, scheduling and fusion to avoid costs that can be amortized in high throughput situations, such as kernel launch overhead.

Truly, there is an abundance of abstractions in the single threaded CPU case for structuring code in more maintainable, more readable, more correct and more convenient ways. There seems to be a poverty of abstractions in multithreaded, concurrent and performance first use cases.

Where to go from here?

Concurrent programming is hard, but things like mutexes, conditional variables, channels, software transactional memory, to randomly name things that come to mind, they help a lot, at a run time cost...

GPU abstractions have focused on making collective operations that treats the units of parallelism as a single lock-step instruction and so side-steps the problems with concurrent semantics all together, but this approach also has it's performance costs and puts responsibility on an optimizing compiler. For maximum performance we need both concurrent semantics that models the execution, the consistency of memory on the hierarchy of caches, and this is hard. But I believe this is better than the alternative solution to this problem: letting an LLM generate all our code.

Arguably the task of writing GPU kernels seems like the perfect case for LLM code synthesis. The goal is to generate an implementation that is faster than a baseline. The task is self-contained, and verification is extremely easy compared to other programming domains, just an accuracy check against the baseline.

This seems like the kind of task that fits reinforcement learning: the code, development tools and profilers are the environment and the reward can be measured as a function of correctness and execution time, large reward for correctness and incremental rewards for speedup over a baseline, or under a maximum theoretical performance baseline. So the technologists at various tech companies are already eagerly using their LLMs to automate away this work. If you search "ai kernel generation", you'd find numerous papers wielding LLMs like a sledge hammer on this problem, because these people have engrained in their beliefs the "bitter lesson" as a fundamental law of the universe.

Here's a thought: suppose we didn't invent newfangled programming languages with features we take for granted today, suppose we didn't have garbage collection, type inference, inheritance, modules, for god forbid even functions. Suppose we just wrote assembly or some very primitive version of C, then do you think LLMs trained on this data will start generating procedures with calling conventions, types, classes, modules, etc.? And do you think the LLM would be nearly as effective as it is today writing Python or JavaScript?

Can LLMs abstract? Can they start generating data that is syntactically and semantically different than all the material that they've been trained on? From the premise it seems that this is counter to the point of maximum likelihood estimation, ie. the foundation of frequentist machine learning of which LLMs are a part of.

LLMs as assistants to generate boilerplate seems a fair use, but coding agents that generate everything end to end? That seems like the wrong software abstraction (for high-performance engineering at least).