Round Egg 7: Compute shaders - GPU queues

2023-11-01

Pedro Burgos, Dominykas Jogela

Last time we explored how to manually initialize and run a compute shader, however, it once again became evident how much effort working directly with wgpu code was going to be. In an attempt to simplify our iteration process, we switched to a compute shader library. We found the one reasonable option to be bevy_app_compute.

Update: Recently, an alternative library that seems better maintained has come up. It might be worth a shot. For the moment we are not thinking of migrating to it. https://github.com/slyedoc/bevy_sly_compute

An important limitation of most platforms is the existence of a single command queue. To my knowledge it is only the latest platforms that allow multiple ones (Directx12 and Vulkan). wgpu however, does not yet support this feature.

The current implementation of wgpu only has one interal queue for submit() calls. Therefore, we still have to know where and when to put our compute submit() calls, as they will run sequentially with the render queue. I know that vulkan has a separate compute queue, and I'm hoping that one day, wgpu will too. I just don't know how feasible / mature this multi-queue system is for other backends wgpu implements.
- Kjolnyr @ Bevy's Github

This means that we cannot run our compute shader without blocking the render queue. This is not a big deal for us, since we only want to run our compute job once, at the beginning of the program.

# A small dive into `submit` calls and GPU queues

Let's explore a bit deeper into the nature of submit calls and GPU queues: This article does a good job at explaining the difference between asynchronous workload submission and parallel workload processing.

"Asynchronous workload submission is the encompasses the ability for the CPU host side to be able to do other work whilst the GPU is processing the workload."
"Parallel workload processing consists of the concurrent execution of two or more workloads by the GPU."

Latest GPUs have several "queue families" (this is Vulkan's naming convention), each with a different purpose (graphics, compute, transfer). Each queue family can have one or more queues.

For example, I believe an NVIDIA 1650 card has 3 queue families:

Conceptual Overview of Queues in NVIDIA 1650

We can actually see check the capabilities of each queue:

Queue family #0 supports transfer, graphics, compute, and presentation
Queue family #1 supports only transfer, and nothing else
Queue family #2 supports transfer, compute, and presentation

Intra-queue parallelization is not supported at the moment, which means that we would in any case only be able to run one type of workload at a time on a single queue.

Any task you throw into a queue is going to attempt to use 100% of the available hardware to finish, however you can generally have one compute task and one graphics task running side by side, as the two are usually separate from each other.

Furthermore, queues (at least in Vulkan), are distinguishable, they can be indexed and can have different priorities. This might be useful when submitting different types of workloads, as a hint to the Vulkan implementation on which tasks you would like to favour.

# Difference between Vulkan and DirectX 12

D3D12 has the same separation between compute pipelines and graphics pipelines that Vulkan does. However, when issuing commands, D3D12 has only one pipeline binding point, to which you can bind any kind of pipeline. By contrast, Vulkan has separate binding points for compute and graphics pipelines. Of course, Vulkan doesn't have different descriptor binding points for them, so the two pipelines can interfere with one another. But if you design their resource usage carefully, it is possible to invoke a dispatch operation without disturbing the needs of the graphics pipeline.
So overall, there's no real difference in pipeline architecture here.
- Nicol Bolas @ Stack Exchange

# Wrapping up

We recommend reading through these two Github issues if interested:

Bevy's Issue 8440: To get a better understanding on Bevy's renderer architecture.
Bevy's Issue 5024: Discussion on separating the render and compute queues. Note that some of the stuff suggested regarding Unity does not appear to be accurate. I do not believe that Unity automatically split compute tasks into smaller pieces.

For our case, wgpu's single queue limitation was not a problem - mostly since we are using compute shaders for fun and since our implementation will only require to run the compute shader on initialization. In the next post we will explain how web ran our sphere deformation shader with bevy_app_compute.

Round Egg 7: Compute shaders - GPU queues

# A small dive into submit calls and GPU queues

# Difference between Vulkan and DirectX 12

# Wrapping up

# A small dive into `submit` calls and GPU queues