Non-OK-status: GpuLaunchKernel error during distributed training of a large model 

I am attempting to train a model with 3 billion parameters on two A100 GPUs using `nvidia-tensorflow 1.15 (21.07-tf1-py3)`, with a batch size of 24 and `tf.distribute.MirroredStrategy`. 

The error message is:
 > 2023-06-03 07:27:26.364872: F tensorflow/core/kernels/concat_lib_gpu_impl.cu.cc:161] Non-OK-status: GpuLaunchKernel( concat_variable_kernel<T, IntType, true>, config.block_count, config.thread_per_block, smem_usage, gpu_device.stream(), input_ptrs, output_scan, static_cast<IntType>(output->dimension(0)), static_cast<IntType>(output->dimension(1)), output->data()) status: Internal: invalid configuration argument

This seems to be an issue that occurs only when the model is large enough and distributed training is used (as the model trains successfully on a single GPU with a batch_size of 12 and on two GPUs with a model size of 1.5B).

I understand that using TensorFlow for training large models may not be the best option, but at present, I need to address this issue. 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Non-OK-status: GpuLaunchKernel error during distributed training of a large model #88

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Non-OK-status: GpuLaunchKernel error during distributed training of a large model #88

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions