Skip to content

Conversation

@qjia7
Copy link
Contributor

@qjia7 qjia7 commented Mar 31, 2021

PERF

This patch uses WEBGPU_DEFERRED_SUBMIT_BATCH_SIZE flag instead of
the old WEBGPU_IMMEDIATE_EXECUTION_ENABLED flag. When
WEBGPU_DEFERRED_SUBMIT_BATCH_SIZE is one, it will become to the old
immediate mode.

queue.submit has a fixed cpu overhead. So it's not efficient to call it
too many times. It's also not a good balance if we just submit once for
each frame. Here we use an empirical value 15, which is observed over 5%
performance improvement for mobilenet/resnet50 on different platforms.

To see the logs from the Cloud Build CI, please join either our discussion or announcement mailing list.


This change is Reviewable

@google-cla google-cla bot added the cla: yes label Mar 31, 2021
PERF

This patch uses WEBGPU_COMMAND_ENCODER_COUNT_IN_QUEUE flag instead of
the old WEBGPU_IMMEDIATE_EXECUTION_ENABLED flag. When
WEBGPU_COMMAND_ENCODER_COUNT_IN_QUEUE is one, it will become to the old
immediate mode.

queue.submit has a fixed cpu overhead. So it's not efficient to call it
too many times. It's also not a good balance if we just submit once for
each frame. Here we use an empirical value 15, which is observed over 5%
performance improvement for mobilenet/resnet50 on different platforms.
@qjia7
Copy link
Contributor Author

qjia7 commented Apr 1, 2021

@kainino0x @lina128 @jinjingforever Please take a look. Thanks.

cc @haoyunfeix @axinging @xhcao @gyagp

}

if (env().get('WEBGPU_IMMEDIATE_EXECUTION_ENABLED')) {
if (env().get('WEBGPU_COMMAND_ENCODER_COUNT_IN_QUEUE') as number ===
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably should be >= just in case multiple command buffers end up waiting for any reason, or the WEBGPU_COMMAND_ENCODER_COUNT_IN_QUEUE is set to 0.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

}

if (env().get('WEBGPU_IMMEDIATE_EXECUTION_ENABLED')) {
if (env().get('WEBGPU_COMMAND_ENCODER_COUNT_IN_QUEUE') as number ===
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: The fact that these are command encoders and not command buffers is I think an accident of history and not strictly necessarily true in the future. I'd pick a more targeted name like
WEBGPU_DEFERRED_SUBMIT_BATCH_SIZE
or something

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@kainino0x
Copy link
Contributor

This is an interesting result. Your explanation:

queue.submit has a fixed cpu overhead.

makes sense to me.

However, and I know we've talked about this several times in the distant past, that overhead shouldn't matter if more work can be accumulated in a single command encoder. Why do we need to submit many small command buffers instead of fewer bigger ones?

@qjia7
Copy link
Contributor Author

qjia7 commented Apr 1, 2021

However, and I know we've talked about this several times in the distant past, that overhead shouldn't matter if more work can be accumulated in a single command encoder. Why do we need to submit many small command buffers instead of fewer bigger ones?

I have another patch #4776 doing the thing like you said. The perf impact is similar like this one (but my test is limited). So I think the main issue is queue.submit is called too many times. Since both ways can reduce submit count and the code change is less compared #4776, I choose the current way. But in fact, maybe we need to do it in the two levels in future. One is do more work in a single command encoder. The other is to have several command encoders in a queue. My reference is dx12-dos-and-donts https://developer.nvidia.com/dx12-dos-and-donts Try to aim at a reasonable number of command lists in the range of 15-30 or below. Try to bundle those CLs into 5-10 ExecuteCommandLists() calls per frame.

What's your suggestion?

@qjia7
Copy link
Contributor Author

qjia7 commented Apr 1, 2021

Please take another look, thanks.

@qjia7 qjia7 changed the title webgpu: Use WEBGPU_COMMAND_ENCODER_COUNT_IN_QUEUE webgpu: Batch submitted command encoders Apr 1, 2021
Copy link
Collaborator

@lina128 lina128 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Reviewable status: :shipit: complete! 2 of 1 approvals obtained (waiting on @jinjingforever)

Copy link
Contributor

@kainino0x kainino0x left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's your suggestion?

My suggestion is similar to D3D12 but of course our submit calls (and our createCommandEncoder/beginComputePass/endPass/finish calls!) are slower.

AFAIK, there is no benefit to smaller command buffers, and I don't think there's any need for many small command buffers in TF.js; it just requires some work to refactor - instead of configuring the commandbuffers-per-submit we would control the ops-per-submit (or even the min number of milliseconds between submits or something).

@qjia7
Copy link
Contributor Author

qjia7 commented Apr 2, 2021

AFAIK, there is no benefit to smaller command buffers, and I don't think there's any need for many small command buffers in TF.js; it just requires some work to refactor - instead of configuring the commandbuffers-per-submit we would control the ops-per-submit (or even the min number of milliseconds between submits or something).

Thanks, Kai. Maybe commandbuffers-per-submit and ops-per-commandbuffer are both needed. Will land this patch first. Then I will add the control for ops-per-commandbuffer.

commandbuffers-per-submit is useful if we are processing more than one frame, then they can be executed parallelly.

@qjia7 qjia7 merged commit a64888b into tensorflow:master Apr 2, 2021
@qjia7 qjia7 deleted the encoder_count branch April 2, 2021 01:26
@kainino0x
Copy link
Contributor

commandbuffers-per-submit is useful if we are processing more than one frame, then they can be executed parallelly.

Do you mean in parallel on the GPU? I don't think separate command buffers gain any parallelism on the GPU - at least, from my understanding of Vulkan it should be able to parallelize different passes in a single command buffer (barriers permitting) as well. But not 100% sure about this.

@qjia7
Copy link
Contributor Author

qjia7 commented Apr 2, 2021

Do you mean in parallel on the GPU? I don't think separate command buffers gain any parallelism on the GPU - at least, from my understanding of Vulkan it should be able to parallelize different passes in a single command buffer (barriers permitting) as well. But not 100% sure about this.

I just checked d3d12 doc. And find below sentences in https://docs.microsoft.com/en-us/windows/win32/direct3d12/executing-and-synchronizing-command-lists#executing-command-lists
Applications can submit command lists to any command queue from multiple threads. The runtime will perform the work of serializing these requests in the order of submission.

Does serializing these requests in the order of submission mean command buffers won't be executed parallelly? Anyway, we just sent message to Rafael. Hope we can get clear answer.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants