-
Notifications
You must be signed in to change notification settings - Fork 2k
webgpu: Batch submitted command encoders #4888
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
PERF This patch uses WEBGPU_COMMAND_ENCODER_COUNT_IN_QUEUE flag instead of the old WEBGPU_IMMEDIATE_EXECUTION_ENABLED flag. When WEBGPU_COMMAND_ENCODER_COUNT_IN_QUEUE is one, it will become to the old immediate mode. queue.submit has a fixed cpu overhead. So it's not efficient to call it too many times. It's also not a good balance if we just submit once for each frame. Here we use an empirical value 15, which is observed over 5% performance improvement for mobilenet/resnet50 on different platforms.
|
@kainino0x @lina128 @jinjingforever Please take a look. Thanks. |
| } | ||
|
|
||
| if (env().get('WEBGPU_IMMEDIATE_EXECUTION_ENABLED')) { | ||
| if (env().get('WEBGPU_COMMAND_ENCODER_COUNT_IN_QUEUE') as number === |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably should be >= just in case multiple command buffers end up waiting for any reason, or the WEBGPU_COMMAND_ENCODER_COUNT_IN_QUEUE is set to 0.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
| } | ||
|
|
||
| if (env().get('WEBGPU_IMMEDIATE_EXECUTION_ENABLED')) { | ||
| if (env().get('WEBGPU_COMMAND_ENCODER_COUNT_IN_QUEUE') as number === |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: The fact that these are command encoders and not command buffers is I think an accident of history and not strictly necessarily true in the future. I'd pick a more targeted name like
WEBGPU_DEFERRED_SUBMIT_BATCH_SIZE
or something
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
|
This is an interesting result. Your explanation:
makes sense to me. However, and I know we've talked about this several times in the distant past, that overhead shouldn't matter if more work can be accumulated in a single command encoder. Why do we need to submit many small command buffers instead of fewer bigger ones? |
I have another patch #4776 doing the thing like you said. The perf impact is similar like this one (but my test is limited). So I think the main issue is queue.submit is called too many times. Since both ways can reduce submit count and the code change is less compared #4776, I choose the current way. But in fact, maybe we need to do it in the two levels in future. One is do more work in a single command encoder. The other is to have several command encoders in a queue. My reference is dx12-dos-and-donts https://developer.nvidia.com/dx12-dos-and-donts What's your suggestion? |
|
Please take another look, thanks. |
lina128
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Reviewable status:
complete! 2 of 1 approvals obtained (waiting on @jinjingforever)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's your suggestion?
My suggestion is similar to D3D12 but of course our submit calls (and our createCommandEncoder/beginComputePass/endPass/finish calls!) are slower.
AFAIK, there is no benefit to smaller command buffers, and I don't think there's any need for many small command buffers in TF.js; it just requires some work to refactor - instead of configuring the commandbuffers-per-submit we would control the ops-per-submit (or even the min number of milliseconds between submits or something).
Thanks, Kai. Maybe commandbuffers-per-submit and ops-per-commandbuffer are both needed. Will land this patch first. Then I will add the control for ops-per-commandbuffer. commandbuffers-per-submit is useful if we are processing more than one frame, then they can be executed parallelly. |
Do you mean in parallel on the GPU? I don't think separate command buffers gain any parallelism on the GPU - at least, from my understanding of Vulkan it should be able to parallelize different passes in a single command buffer (barriers permitting) as well. But not 100% sure about this. |
I just checked d3d12 doc. And find below sentences in https://docs.microsoft.com/en-us/windows/win32/direct3d12/executing-and-synchronizing-command-lists#executing-command-lists Does |
PERF
This patch uses WEBGPU_DEFERRED_SUBMIT_BATCH_SIZE flag instead of
the old WEBGPU_IMMEDIATE_EXECUTION_ENABLED flag. When
WEBGPU_DEFERRED_SUBMIT_BATCH_SIZE is one, it will become to the old
immediate mode.
queue.submit has a fixed cpu overhead. So it's not efficient to call it
too many times. It's also not a good balance if we just submit once for
each frame. Here we use an empirical value 15, which is observed over 5%
performance improvement for mobilenet/resnet50 on different platforms.
To see the logs from the Cloud Build CI, please join either our discussion or announcement mailing list.
This change is