This repository was archived by the owner on Aug 12, 2025. It is now read-only.

Description
I am attempting to train a model with 3 billion parameters on two A100 GPUs using nvidia-tensorflow 1.15 (21.07-tf1-py3), with a batch size of 24 and tf.distribute.MirroredStrategy.
The error message is:
2023-06-03 07:27:26.364872: F tensorflow/core/kernels/concat_lib_gpu_impl.cu.cc:161] Non-OK-status: GpuLaunchKernel( concat_variable_kernel<T, IntType, true>, config.block_count, config.thread_per_block, smem_usage, gpu_device.stream(), input_ptrs, output_scan, static_cast(output->dimension(0)), static_cast(output->dimension(1)), output->data()) status: Internal: invalid configuration argument
This seems to be an issue that occurs only when the model is large enough and distributed training is used (as the model trains successfully on a single GPU with a batch_size of 12 and on two GPUs with a model size of 1.5B).
I understand that using TensorFlow for training large models may not be the best option, but at present, I need to address this issue.