Some thoughts on the newest NVIDIA GPU features
In trying to write a GPU compiler myself I had to familiarize myself with the newest features on Hopper and Blackwell, despite not owning one of those GPUs (and giving up a kidney in the process); what I learned did not sparkle joy.
I would categorize these features as either 1) communication/synchronization, 2) tensor, 3) misc. They are rather obscure and I tried to get a high-level understanding, since I couldn't actually program with them due to a lack of hardware. So here goes a birds-eye-view from 10000ft summary:
Communication
TMA - Tensor Memory Accelerator
This introduces hardware specifically for moving contiguous strided memory segments in and out of global memory asynchronously, interfacing shared memory directly and not needing registers to calculate indices. This makes sense from a deep learning perspective as most memory copies are of this form: contiguous chunks of tensors. However, what's not so nice is the synchronization mechanisms involved. Async-loads and async stores goes through two synchronization pathways (what Nvidia calls proxies), which seems to be independent streams of execution, conceptually that requires different primitives to synchronize with the "main" stream of execution (the "default proxy").
Also of note is that only a single thread is required to launch a transfer (of a large block-sized tile), by specifying the subranges of memory through a tensor descriptor.
mbarriers
An arrive/wait barrier initiated in shared memory that synchronizes on the block/cluster level. Threads run arrive(&bar), do some work, then call wait(&bar, expected), which unblocks after the expected number of threads call arrive(&bar). This from a compiler perspective is a good direction as it can be used flexibly to synchronize any subset of threads, and perhaps implement some async abstractions on the GPU. What's annoying is that the synchronization only affects some proxies but not others, some async instructions can be waited on but not others.
The alternative to mbarriers before is to use atomics on the shared memory level to synchronize, which I assume is slower.
Tensor
There's a lot of tensor core instructions introduced, some variations of very low precision mma instructions as well as block scaled ones. I did not look too much into these as fundamentally they do the same thing: a matrix multiplication asic, but just in specialized variations. What is different however is 'tensor memory' and tcgen05 instructions introduced in blackwell. Tensor memory is yet another memory space on the GPU distinct from registers and shared memory, which are for the tcgen05 matrix asics only. To use them you have to first transfer global to shared, then shared to tensor memory, which appropriate synchronization in between. tcgen05 also executes asynchronously, so conceptually they are scheduled by the main thread of execution and independently execute, as opposed to the tensor-core instructions of Turing and Ampere, which block the main thread of execution.
All of this serves to complicate the process of achieving maximum performance and several other features are required to patch up the mess.
Misc
Warp specialization
Warp specialization is a technique that uses the setmaxnreg instruction, whereby individual warps take on divergent tasks in a pipeline, such as loading from global, storing back to global, and doing computation. This is required by the stuff introduced before because the TMA instructions cuts down the register usage of loading and storing from global by quite a lot, whereas computation needs a lot more. setmaxnreg lets you adjust the number of registers to each warp to specialize them for different tasks. Whereas before (pre Hopper), each warp had to do everything as typical in SIMT programs: loading from global, computing and storing back. Now consumer and producer warps communicate through shared memory using mbarrier synchronization mechanisms, each with different register usage and stresses a different pipeline of the GPU.
If TMA and tcgen05 weren't a thing then I doubt this feature would be useful, since it seems to be introduced to patch up the register imbalance created by those instructions. Also, if your program cannot use the aforementioned instructions, then I doubt this feature would be particularly useful. If your program does however, then it almost seems like to achieve the maximum performance you have to use this technique, as the Gluon tutorial demonstrates that on Blackwell, even a simple elementwise add kernel benefits from warp specialization and a producer + consumer pipeline in shared memory. The end result became very complicated though.
Cluster Launch Control
This feature allows blocks to cancel other blocks and steal their work. It is used in a niche way to alleviate deficiencies in the GPU block scheduler for certain workloads where an highly optimized kernel A assumes ownership of all SMs, but gets thrashed by concurrent execution of another kernel B, contending for the same SMs.
The Gluon tutorial shows that you need TMA + Warp specialization + Pipelining + Block scheduling and Cluster Launch Control to match cuBLAS performance.
Thoughts
I feels to me like Nvidia on top of their uniform CUDA programming model shoehorned a bunch of asics only accessible through complex ptx instructions with obscure semantics. The premise of these features is that they speed up deep learning work loads, but also are composable enough to maybe benefit other applications. In general they seem to enable more offloading of computation from the "main thread of computation" - TMA for loads and stores and the tensor cores for MMA, so the main thread can do other useful work. This allows each GPU kernel to do more, be more programmatic and expressive, better overlap computation with communication, etc.
But the reality is that I haven't seen these features used anywhere outside of matrix multiplication and maybe flash-attention 3. The various synchronization proxies, specialized patterns of using them, and semantics makes it very difficult to target by a compiler, to paraphrase the justification for the Gluon language: the Triton compiler cannot achieve 80% of speed-of-light performance on Hopper/Blackwell (whereas before on Ampere it could). Gluon was created to offer lower-level control, but at the cost of simplicity. Their code for GEMM with all the optimizations feels like a templating language designed specifically for matrix multiplication and contraction-esque workloads. Indeed, it is hyper specialized to Nvidia and the semantics of their instructions, not as a domain specific language offering nice abstractions over the GPU but a code generator using ad-hoc python metaprogramming slightly better than C++ templates.
The lack of an uniform interface, or consistent semantics is what really irks me. Since Ampere there are at least 3 ways to synchronize asynchronous things: cp.async.commit/wait for the first generation Ampere asynchronous memory copies, mbarriers (which can also synchronize cp.async requests), fence_async_shared as a memory fence for TMA async stores. Combined with at least 3 ways of synchronizing threads: __syncthreads() & __syncwarp(), mbarriers, various atomic instructions/load instructions with their own semantics and scopes (acquire/release, thread/CTA/gpu/system). What I'm not even sure at this point is whether these instructions will ever be backwards compatible down the line, because at this point it feels like a hodgepodge that will be depreciated a few generations later, nevermind being highly Nvidia specific and therefore non-portable.
This situation reminds me of CISC vs RISC back when it was relevant. CISC assumed that programmers liked writing assembly and thus made complex instructions with complex semantics, whereas what we really need and perhaps deserve is a simple instruction set with regular semantics, designed to be targetable by a compiler on a device where performance is not highly non-linear, where decent performance can be reached by a compiler with non-trivial perhaps, but not impossible optimizations. Nvidia made the common case suffer by devoting silicon to asics for matrix-like workloads - which is arguably the worst of both worlds because why not just make an asic like a TPU or something?