Fix typos in the text (#2417)
This commit is contained in:
@ -5,11 +5,11 @@ overlap their execution, named
|
||||
[Programmatic Dependent Launch (PDL)](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#programmatic-dependent-launch-and-synchronization).
|
||||
This allows kernels with conflict in global memory to programmatically and safely overlap portions
|
||||
of their execution. Primary kernel can signal it is about to finish execution, and the next kernel is expected to
|
||||
programatically wait on the previous kernel to finish flushing its memory.
|
||||
programmatically wait on the previous kernel to finish flushing its memory.
|
||||
|
||||
We enable PDL by setting a flag through the extended CUDA launch APIs. All CUTLASS kernels with PDL support
|
||||
will wait on the prior kernel to flush its output to memory and signal the next kernel to start. This means
|
||||
they can safely be dropped in with any other set of kernels using PDL as long as they also adhear to waiting on
|
||||
they can safely be dropped in with any other set of kernels using PDL as long as they also adhere to waiting on
|
||||
the prior to flush its memory as well.
|
||||
|
||||
For more information, we refer you to the [PDL section in the CUDA Programming Guide](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#programmatic-dependent-launch-and-synchronization).
|
||||
@ -36,10 +36,10 @@ gemm.run(
|
||||
## Model-Aware Optimizations with PDL
|
||||
|
||||
In [example 63](https://github.com/NVIDIA/cutlass/tree/main/examples/63_hopper_gemm_with_weight_prefetch/README.md), we use PDL to explicitly optimize for
|
||||
performance of kernels where we know that one of the input matricies (our weights) will not be produced by a prior
|
||||
performance of kernels where we know that one of the input matrices (our weights) will not be produced by a prior
|
||||
kernel. In that case, we only need to wait on the prior kernels memory flush in order to load the other input matrix
|
||||
(our activations). During our prologue, we can prefetch our weights to improve performance for memory bandwidth-bound
|
||||
problem sizes. For more informations we refer the reader to [the example](https://github.com/NVIDIA/cutlass/tree/main/examples/63_hopper_gemm_with_weight_prefetch/README.md).
|
||||
problem sizes. For more information, we refer the reader to [the example](https://github.com/NVIDIA/cutlass/tree/main/examples/63_hopper_gemm_with_weight_prefetch/README.md).
|
||||
|
||||
## Copyright
|
||||
|
||||
|
||||
Reference in New Issue
Block a user