-
Notifications
You must be signed in to change notification settings - Fork 57
Add Support for GPT-2 Training on different Devices #551
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Contributor
Author
NPU(910B3)[09/10 10:37:57 libai]: >>> done with building model. Building time: 0.282 seconds
WARNING [09/10 10:37:57 lb.scheduler.lr_scheduler]: warmup iters equals to zero, return CosineLR
[09/10 10:38:03 lb.engine.trainer]: Starting training from iteration 0
[09/10 10:40:56 lb.utils.events]: eta: 21:00:38 iteration: 19/10000 consumed_samples: 80 total_loss: 9.895 time: 7.5187 s/iter data_time: 0.0021 s/iter total_throughput: 0.53 samples/s lr: 1.50e-04
[09/10 10:43:32 lb.utils.events]: eta: 21:05:47 iteration: 39/10000 consumed_samples: 160 total_loss: 9.027 time: 7.6572 s/iter data_time: 0.0019 s/iter total_throughput: 0.52 samples/s lr: 1.50e-04
[09/10 10:46:05 lb.utils.events]: eta: 21:06:05 iteration: 59/10000 consumed_samples: 240 total_loss: 8.362 time: 7.6549 s/iter data_time: 0.0015 s/iter total_throughput: 0.52 samples/s lr: 1.50e-04
[09/10 10:48:42 lb.utils.events]: eta: 21:08:55 iteration: 79/10000 consumed_samples: 320 total_loss: 7.847 time: 7.7127 s/iter data_time: 0.0013 s/iter total_throughput: 0.52 samples/s lr: 1.50e-04
[09/10 10:51:22 lb.utils.events]: eta: 21:18:52 iteration: 99/10000 consumed_samples: 400 total_loss: 7.628 time: 7.7640 s/iter data_time: 0.0013 s/iter total_throughput: 0.52 samples/s lr: 1.50e-04
[09/10 10:53:53 lb.utils.events]: eta: 21:04:10 iteration: 119/10000 consumed_samples: 480 total_loss: 7.441 time: 7.7314 s/iter data_time: 0.0013 s/iter total_throughput: 0.52 samples/s lr: 1.50e-04CUDA(A100)[09/10 10:50:47 libai]: >>> done with building model. Building time: 5.722 seconds
WARNING [09/10 10:50:47 lb.scheduler.lr_scheduler]: warmup iters equals to zero, return CosineLR
[09/10 10:50:50 lb.engine.trainer]: Starting training from iteration 0
[09/10 10:50:54 lb.utils.events]: eta: 0:10:15 iteration: 19/10000 consumed_samples: 80 total_loss: 9.83 time: 0.0689 s/iter data_time: 0.0008 s/iter total_throughput: 58.05 samples/s lr: 1.50e-04
[09/10 10:50:58 lb.utils.events]: eta: 0:10:15 iteration: 39/10000 consumed_samples: 160 total_loss: 9.122 time: 0.1458 s/iter data_time: 0.0007 s/iter total_throughput: 27.43 samples/s lr: 1.50e-04
[09/10 10:51:00 lb.utils.events]: eta: 0:10:12 iteration: 59/10000 consumed_samples: 240 total_loss: 8.388 time: 0.1214 s/iter data_time: 0.0007 s/iter total_throughput: 32.94 samples/s lr: 1.50e-04
[09/10 10:51:03 lb.utils.events]: eta: 0:10:11 iteration: 79/10000 consumed_samples: 320 total_loss: 8.019 time: 0.1357 s/iter data_time: 0.0008 s/iter total_throughput: 29.48 samples/s lr: 1.50e-04
[09/10 10:51:05 lb.utils.events]: eta: 0:10:09 iteration: 99/10000 consumed_samples: 400 total_loss: 7.635 time: 0.1232 s/iter data_time: 0.0008 s/iter total_throughput: 32.47 samples/s lr: 1.50e-04
[09/10 10:51:06 lb.utils.events]: eta: 0:10:09 iteration: 119/10000 consumed_samples: 480 total_loss: 7.461 time: 0.1132 s/iter data_time: 0.0008 s/iter total_throughput: 35.34 samples/s lr: 1.50e-04
[09/10 10:51:08 lb.utils.events]: eta: 0:10:09 iteration: 139/10000 consumed_samples: 560 total_loss: 7.367 time: 0.1061 s/iter data_time: 0.0009 s/iter total_throughput: 37.72 samples/s lr: 1.50e-04
[09/10 10:51:09 lb.utils.events]: eta: 0:10:06 iteration: 159/10000 consumed_samples: 640 total_loss: 7.305 time: 0.1003 s/iter data_time: 0.0008 s/iter total_throughput: 39.88 samples/s lr: 1.50e-04
[09/10 10:51:10 lb.utils.events]: eta: 0:10:04 iteration: 179/10000 consumed_samples: 720 total_loss: 7.214 time: 0.0975 s/iter data_time: 0.0008 s/iter total_throughput: 41.02 samples/s lr: 1.50e-04
[09/10 10:51:12 lb.utils.events]: eta: 0:10:03 iteration: 199/10000 consumed_samples: 800 total_loss: 7.132 time: 0.0940 s/iter data_time: 0.0007 s/iter total_throughput: 42.55 samples/s lr: 1.50e-04
[09/10 10:51:13 lb.utils.events]: eta: 0:10:02 iteration: 219/10000 consumed_samples: 880 total_loss: 6.986 time: 0.0911 s/iter data_time: 0.0008 s/iter total_throughput: 43.93 samples/s lr: 1.50e-04
[09/10 10:51:14 lb.utils.events]: eta: 0:10:01 iteration: 239/10000 consumed_samples: 960 total_loss: 6.866 time: 0.0886 s/iter data_time: 0.0009 s/iter total_throughput: 45.15 samples/s lr: 1.50e-04
[09/10 10:51:18 lb.utils.events]: eta: 0:10:00 iteration: 259/10000 consumed_samples: 1040 total_loss: 6.764 time: 0.0958 s/iter data_time: 0.0008 s/iter total_throughput: 41.74 samples/s lr: 1.50e-04
[09/10 10:51:19 lb.utils.events]: eta: 0:09:58 iteration: 279/10000 consumed_samples: 1120 total_loss: 6.655 time: 0.0933 s/iter data_time: 0.0008 s/iter total_throughput: 42.85 samples/s lr: 1.50e-04 |
xiezipeng-ML
approved these changes
Sep 11, 2024
Flowingsun007
approved these changes
Sep 12, 2024
fpzh2011
approved these changes
Sep 12, 2024
0x404
approved these changes
Sep 12, 2024
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Getting Started
Prepare the Data and Vocabulary
Ensure the correct location of
gpt_data:Option 1: Create a symbolic link
Option 2: Modify the configuration file
configs/gpt2_pretrain.pyAdjust the following configuration based on your specific environment:
How to Train gpt2 Model with NPU/XPU
python3 -m oneflow.distributed.launch \ --nproc_per_node 1 \ --nnodes 1 \ --node_rank 0 \ --master_addr 127.0.0.1 \ --master_port 12345 \ tools/train_net.py --config-file=configs/gpt2_pretrain.py \ graph.enabled=False \ train.input_placement_device="npu" \ train.dist.device_type="npu" \ train.amp.enabled=False \ model.cfg.scale_mask_softmax_fusion=False \ model.cfg.bias_gelu_fusion=FalseIf you want to train on XPU, please change 'npu' to 'xpu'.