webgpu: Batch submitted command encoders #4888

qjia7 · 2021-03-31T13:39:39Z

PERF

This patch uses WEBGPU_DEFERRED_SUBMIT_BATCH_SIZE flag instead of
the old WEBGPU_IMMEDIATE_EXECUTION_ENABLED flag. When
WEBGPU_DEFERRED_SUBMIT_BATCH_SIZE is one, it will become to the old
immediate mode.

queue.submit has a fixed cpu overhead. So it's not efficient to call it
too many times. It's also not a good balance if we just submit once for
each frame. Here we use an empirical value 15, which is observed over 5%
performance improvement for mobilenet/resnet50 on different platforms.

To see the logs from the Cloud Build CI, please join either our discussion or announcement mailing list.

This change is

PERF This patch uses WEBGPU_COMMAND_ENCODER_COUNT_IN_QUEUE flag instead of the old WEBGPU_IMMEDIATE_EXECUTION_ENABLED flag. When WEBGPU_COMMAND_ENCODER_COUNT_IN_QUEUE is one, it will become to the old immediate mode. queue.submit has a fixed cpu overhead. So it's not efficient to call it too many times. It's also not a good balance if we just submit once for each frame. Here we use an empirical value 15, which is observed over 5% performance improvement for mobilenet/resnet50 on different platforms.

qjia7 · 2021-04-01T01:34:36Z

@kainino0x @lina128 @jinjingforever Please take a look. Thanks.

cc @haoyunfeix @axinging @xhcao @gyagp

kainino0x · 2021-04-01T01:32:42Z

tfjs-backend-webgpu/src/backend_webgpu.ts

    }

-    if (env().get('WEBGPU_IMMEDIATE_EXECUTION_ENABLED')) {
+    if (env().get('WEBGPU_COMMAND_ENCODER_COUNT_IN_QUEUE') as number ===


Probably should be >= just in case multiple command buffers end up waiting for any reason, or the WEBGPU_COMMAND_ENCODER_COUNT_IN_QUEUE is set to 0.

kainino0x · 2021-04-01T01:38:53Z

tfjs-backend-webgpu/src/backend_webgpu.ts

    }

-    if (env().get('WEBGPU_IMMEDIATE_EXECUTION_ENABLED')) {
+    if (env().get('WEBGPU_COMMAND_ENCODER_COUNT_IN_QUEUE') as number ===


nit: The fact that these are command encoders and not command buffers is I think an accident of history and not strictly necessarily true in the future. I'd pick a more targeted name like
WEBGPU_DEFERRED_SUBMIT_BATCH_SIZE
or something

kainino0x · 2021-04-01T01:41:56Z

This is an interesting result. Your explanation:

queue.submit has a fixed cpu overhead.

makes sense to me.

However, and I know we've talked about this several times in the distant past, that overhead shouldn't matter if more work can be accumulated in a single command encoder. Why do we need to submit many small command buffers instead of fewer bigger ones?

qjia7 · 2021-04-01T02:35:29Z

However, and I know we've talked about this several times in the distant past, that overhead shouldn't matter if more work can be accumulated in a single command encoder. Why do we need to submit many small command buffers instead of fewer bigger ones?

I have another patch #4776 doing the thing like you said. The perf impact is similar like this one (but my test is limited). So I think the main issue is queue.submit is called too many times. Since both ways can reduce submit count and the code change is less compared #4776, I choose the current way. But in fact, maybe we need to do it in the two levels in future. One is do more work in a single command encoder. The other is to have several command encoders in a queue. My reference is dx12-dos-and-donts https://developer.nvidia.com/dx12-dos-and-donts Try to aim at a reasonable number of command lists in the range of 15-30 or below. Try to bundle those CLs into 5-10 ExecuteCommandLists() calls per frame.

What's your suggestion?

qjia7 · 2021-04-01T05:29:09Z

Please take another look, thanks.

lina128

LGTM

Reviewable status: complete! 2 of 1 approvals obtained (waiting on @jinjingforever)

kainino0x

What's your suggestion?

My suggestion is similar to D3D12 but of course our submit calls (and our createCommandEncoder/beginComputePass/endPass/finish calls!) are slower.

AFAIK, there is no benefit to smaller command buffers, and I don't think there's any need for many small command buffers in TF.js; it just requires some work to refactor - instead of configuring the commandbuffers-per-submit we would control the ops-per-submit (or even the min number of milliseconds between submits or something).

qjia7 · 2021-04-02T01:01:04Z

AFAIK, there is no benefit to smaller command buffers, and I don't think there's any need for many small command buffers in TF.js; it just requires some work to refactor - instead of configuring the commandbuffers-per-submit we would control the ops-per-submit (or even the min number of milliseconds between submits or something).

Thanks, Kai. Maybe commandbuffers-per-submit and ops-per-commandbuffer are both needed. Will land this patch first. Then I will add the control for ops-per-commandbuffer.

commandbuffers-per-submit is useful if we are processing more than one frame, then they can be executed parallelly.

kainino0x · 2021-04-02T02:01:11Z

commandbuffers-per-submit is useful if we are processing more than one frame, then they can be executed parallelly.

Do you mean in parallel on the GPU? I don't think separate command buffers gain any parallelism on the GPU - at least, from my understanding of Vulkan it should be able to parallelize different passes in a single command buffer (barriers permitting) as well. But not 100% sure about this.

qjia7 · 2021-04-02T06:21:11Z

Do you mean in parallel on the GPU? I don't think separate command buffers gain any parallelism on the GPU - at least, from my understanding of Vulkan it should be able to parallelize different passes in a single command buffer (barriers permitting) as well. But not 100% sure about this.

I just checked d3d12 doc. And find below sentences in https://docs.microsoft.com/en-us/windows/win32/direct3d12/executing-and-synchronizing-command-lists#executing-command-lists
Applications can submit command lists to any command queue from multiple threads. The runtime will perform the work of serializing these requests in the order of submission.

Does serializing these requests in the order of submission mean command buffers won't be executed parallelly? Anyway, we just sent message to Rafael. Hope we can get clear answer.

google-cla bot added the cla: yes label Mar 31, 2021

qjia7 force-pushed the encoder_count branch from 41b6b95 to d2d36ca Compare April 1, 2021 01:22

qjia7 requested review from jinjingforever, kainino0x and lina128 April 1, 2021 01:30

kainino0x approved these changes Apr 1, 2021

View reviewed changes

Address Kai's comment

c4d64c5

qjia7 changed the title ~~webgpu: Use WEBGPU_COMMAND_ENCODER_COUNT_IN_QUEUE~~ webgpu: Batch submitted command encoders Apr 1, 2021

lina128 approved these changes Apr 1, 2021

View reviewed changes

kainino0x approved these changes Apr 1, 2021

View reviewed changes

Merge branch 'master' into encoder_count

6ca8e21

qjia7 merged commit a64888b into tensorflow:master Apr 2, 2021

qjia7 deleted the encoder_count branch April 2, 2021 01:26

webgpu: Batch submitted command encoders #4888

webgpu: Batch submitted command encoders #4888

Uh oh!

Conversation

qjia7 commented Mar 31, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

qjia7 commented Apr 1, 2021

Uh oh!

kainino0x Apr 1, 2021

Choose a reason for hiding this comment

Uh oh!

qjia7 Apr 1, 2021

Choose a reason for hiding this comment

Uh oh!

kainino0x Apr 1, 2021

Choose a reason for hiding this comment

Uh oh!

qjia7 Apr 1, 2021

Choose a reason for hiding this comment

Uh oh!

kainino0x commented Apr 1, 2021

Uh oh!

qjia7 commented Apr 1, 2021

Uh oh!

qjia7 commented Apr 1, 2021

Uh oh!

lina128 left a comment

Choose a reason for hiding this comment

Uh oh!

kainino0x left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

qjia7 commented Apr 2, 2021

Uh oh!

kainino0x commented Apr 2, 2021

Uh oh!

qjia7 commented Apr 2, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

qjia7 commented Mar 31, 2021 •

edited

Loading

kainino0x left a comment •

edited

Loading