Skip to content

Commit 2e2a943

Browse files
committed
Update distributed pytorch notebook
1 parent 94ba3f3 commit 2e2a943

File tree

2 files changed

+74
-29
lines changed

2 files changed

+74
-29
lines changed

how-to-use-azureml/training-with-deep-learning/distributed-pytorch-with-horovod/distributed-pytorch-with-horovod.ipynb

Lines changed: 55 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -14,19 +14,16 @@
1414
"metadata": {},
1515
"source": [
1616
"# Distributed PyTorch with Horovod\n",
17-
"In this tutorial, you will train a PyTorch model on the [MNIST](http://yann.lecun.com/exdb/mnist/) dataset using distributed training via [Horovod](https://github.com/uber/horovod)."
17+
"In this tutorial, you will train a PyTorch model on the [MNIST](http://yann.lecun.com/exdb/mnist/) dataset using distributed training via [Horovod](https://github.com/uber/horovod) across a GPU cluster."
1818
]
1919
},
2020
{
2121
"cell_type": "markdown",
2222
"metadata": {},
2323
"source": [
2424
"## Prerequisites\n",
25-
"* Understand the [architecture and terms](https://docs.microsoft.com/azure/machine-learning/service/concept-azure-machine-learning-architecture) introduced by Azure Machine Learning (AML)\n",
26-
"* Go through the [00.configuration.ipynb](https://github.com/Azure/MachineLearningNotebooks/blob/master/00.configuration.ipynb) notebook to:\n",
27-
" * install the AML SDK\n",
28-
" * create a workspace and its configuration file (`config.json`)\n",
29-
"* Review the [tutorial](https://aka.ms/aml-notebook-pytorch) on single-node PyTorch training using the SDK"
25+
"* Go through the [Configuration](https://github.com/Azure/MachineLearningNotebooks/blob/master/configuration.ipynb) notebook to install the Azure Machine Learning Python SDK and create an Azure ML `Workspace`\n",
26+
"* Review the [tutorial](https://github.com/Azure/MachineLearningNotebooks/blob/master/how-to-use-azureml/training-with-deep-learning/train-hyperparameter-tune-deploy-with-pytorch/train-hyperparameter-tune-deploy-with-pytorch.ipynb) on single-node PyTorch training using Azure Machine Learning"
3027
]
3128
},
3229
{
@@ -92,10 +89,12 @@
9289
"cell_type": "markdown",
9390
"metadata": {},
9491
"source": [
95-
"## Create a remote compute target\n",
96-
"You will need to create a [compute target](https://docs.microsoft.com/azure/machine-learning/service/concept-azure-machine-learning-architecture#compute-target) to execute your training script on. In this tutorial, you create an `AmlCompute` cluster as your training compute resource. This code creates a cluster for you if it does not already exist in your workspace.\n",
92+
"## Create or attach existing AmlCompute\n",
93+
"You will need to create a [compute target](https://docs.microsoft.com/azure/machine-learning/service/concept-azure-machine-learning-architecture#compute-target) for training your model. In this tutorial, we use Azure ML managed compute ([AmlCompute](https://docs.microsoft.com/azure/machine-learning/service/how-to-set-up-training-targets#amlcompute)) for our remote training compute resource. Specifically, the below code creates an `STANDARD_NC6` GPU cluster that autoscales from `0` to `4` nodes.\n",
9794
"\n",
98-
"**Creation of the cluster takes approximately 5 minutes.** If the cluster is already in your workspace this code will skip the cluster creation process."
95+
"**Creation of AmlCompute takes approximately 5 minutes.** If the AmlCompute with that name is already in your workspace, this code will skip the creation process.\n",
96+
"\n",
97+
"As with other Azure services, there are limits on certain resources (e.g. AmlCompute) associated with the Azure Machine Learning service. Please read [this article](https://docs.microsoft.com/en-us/azure/machine-learning/service/how-to-manage-quotas) on the default limits and how to request more quota."
9998
]
10099
},
101100
{
@@ -115,31 +114,31 @@
115114
" print('Found existing compute target.')\n",
116115
"except ComputeTargetException:\n",
117116
" print('Creating a new compute target...')\n",
118-
" compute_config = AmlCompute.provisioning_configuration(vm_size='STANDARD_NC6', \n",
117+
" compute_config = AmlCompute.provisioning_configuration(vm_size='STANDARD_NC6',\n",
119118
" max_nodes=4)\n",
120119
"\n",
121120
" # create the cluster\n",
122121
" compute_target = ComputeTarget.create(ws, cluster_name, compute_config)\n",
123122
"\n",
124123
" compute_target.wait_for_completion(show_output=True)\n",
125124
"\n",
126-
"# Use the 'status' property to get a detailed status for the current cluster. \n",
125+
"# Use the 'status' property to get a detailed status for the current AmlCompute. \n",
127126
"print(compute_target.status.serialize())"
128127
]
129128
},
130129
{
131130
"cell_type": "markdown",
132131
"metadata": {},
133132
"source": [
134-
"The above code creates a GPU cluster. If you instead want to create a CPU cluster, provide a different VM size to the `vm_size` parameter, such as `STANDARD_D2_V2`."
133+
"The above code creates GPU compute. If you instead want to create CPU compute, provide a different VM size to the `vm_size` parameter, such as `STANDARD_D2_V2`."
135134
]
136135
},
137136
{
138137
"cell_type": "markdown",
139138
"metadata": {},
140139
"source": [
141140
"## Train model on the remote compute\n",
142-
"Now that we have the cluster ready to go, let's run our distributed training job."
141+
"Now that we have the AmlCompute ready to go, let's run our distributed training job."
143142
]
144143
},
145144
{
@@ -166,7 +165,27 @@
166165
"cell_type": "markdown",
167166
"metadata": {},
168167
"source": [
169-
"Copy the training script `pytorch_horovod_mnist.py` into this project directory."
168+
"### Prepare training script\n",
169+
"Now you will need to create your training script. In this tutorial, the script for distributed training of MNIST is already provided for you at `pytorch_horovod_mnist.py`. In practice, you should be able to take any custom PyTorch training script as is and run it with Azure ML without having to modify your code.\n",
170+
"\n",
171+
"However, if you would like to use Azure ML's [metric logging](https://docs.microsoft.com/azure/machine-learning/service/concept-azure-machine-learning-architecture#logging) capabilities, you will have to add a small amount of Azure ML logic inside your training script. In this example, at each logging interval, we will log the loss for that minibatch to our Azure ML run.\n",
172+
"\n",
173+
"To do so, in `pytorch_horovod_mnist.py`, we will first access the Azure ML `Run` object within the script:\n",
174+
"```Python\n",
175+
"from azureml.core.run import Run\n",
176+
"run = Run.get_context()\n",
177+
"```\n",
178+
"Later within the script, we log the loss metric to our run:\n",
179+
"```Python\n",
180+
"run.log('loss', loss.item())\n",
181+
"```"
182+
]
183+
},
184+
{
185+
"cell_type": "markdown",
186+
"metadata": {},
187+
"source": [
188+
"Once your script is ready, copy the training script `pytorch_horovod_mnist.py` into the project directory."
170189
]
171190
},
172191
{
@@ -205,7 +224,7 @@
205224
"metadata": {},
206225
"source": [
207226
"### Create a PyTorch estimator\n",
208-
"The AML SDK's PyTorch estimator enables you to easily submit PyTorch training jobs for both single-node and distributed runs. For more information on the PyTorch estimator, refer [here](https://docs.microsoft.com/azure/machine-learning/service/how-to-train-pytorch)."
227+
"The Azure ML SDK's PyTorch estimator enables you to easily submit PyTorch training jobs for both single-node and distributed runs. For more information on the PyTorch estimator, refer [here](https://docs.microsoft.com/azure/machine-learning/service/how-to-train-pytorch)."
209228
]
210229
},
211230
{
@@ -232,6 +251,26 @@
232251
"The above code specifies that we will run our training script on `2` nodes, with one worker per node. In order to execute a distributed run using MPI/Horovod, you must provide the argument `distributed_backend='mpi'`. Using this estimator with these settings, PyTorch, Horovod and their dependencies will be installed for you. However, if your script also uses other packages, make sure to install them via the `PyTorch` constructor's `pip_packages` or `conda_packages` parameters."
233252
]
234253
},
254+
{
255+
"cell_type": "markdown",
256+
"metadata": {},
257+
"source": [
258+
"To use the latest version of PyTorch 1.0, run the following cell:"
259+
]
260+
},
261+
{
262+
"cell_type": "code",
263+
"execution_count": null,
264+
"metadata": {},
265+
"outputs": [],
266+
"source": [
267+
"estimator.conda_dependencies.remove_conda_package('pytorch=0.4.0')\n",
268+
"estimator.conda_dependencies.remove_pip_package('horovod==0.13.11')\n",
269+
"estimator.conda_dependencies.add_conda_package('pytorch-nightly')\n",
270+
"estimator.conda_dependencies.add_channel('pytorch')\n",
271+
"estimator.conda_dependencies.add_pip_package('horovod==0.15.2')"
272+
]
273+
},
235274
{
236275
"cell_type": "markdown",
237276
"metadata": {},
@@ -255,7 +294,7 @@
255294
"metadata": {},
256295
"source": [
257296
"### Monitor your run\n",
258-
"You can monitor the progress of the run with a Jupyter widget. Like the run submission, the widget is asynchronous and provides live updates every 10-15 seconds until the job completes."
297+
"You can monitor the progress of the run with a Jupyter widget. Like the run submission, the widget is asynchronous and provides live updates every 10-15 seconds until the job completes. You can see that the widget automatically plots and visualizes the loss metric that we logged to the Azure ML run."
259298
]
260299
},
261300
{

how-to-use-azureml/training-with-deep-learning/distributed-pytorch-with-horovod/pytorch_horovod_mnist.py

Lines changed: 19 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -1,17 +1,16 @@
1-
# Copyright 2017 Uber Technologies, Inc.
2-
# Licensed under the Apache License, Version 2.0
3-
# Script from horovod/examples: https://github.com/uber/horovod/blob/master/examples/pytorch_mnist.py
4-
51
from __future__ import print_function
62
import argparse
73
import torch.nn as nn
84
import torch.nn.functional as F
95
import torch.optim as optim
106
from torchvision import datasets, transforms
11-
from torch.autograd import Variable
127
import torch.utils.data.distributed
138
import horovod.torch as hvd
149

10+
from azureml.core.run import Run
11+
# get the Azure ML run object
12+
run = Run.get_context()
13+
1514
# Training settings
1615
parser = argparse.ArgumentParser(description='PyTorch MNIST Example')
1716
parser.add_argument('--batch-size', type=int, default=64, metavar='N',
@@ -30,6 +29,8 @@
3029
help='random seed (default: 42)')
3130
parser.add_argument('--log-interval', type=int, default=10, metavar='N',
3231
help='how many batches to wait before logging training status')
32+
parser.add_argument('--fp16-allreduce', action='store_true', default=False,
33+
help='use fp16 compression during allreduce')
3334
args = parser.parse_args()
3435
args.cuda = not args.no_cuda and torch.cuda.is_available()
3536

@@ -97,9 +98,13 @@ def forward(self, x):
9798
optimizer = optim.SGD(model.parameters(), lr=args.lr * hvd.size(),
9899
momentum=args.momentum)
99100

101+
# Horovod: (optional) compression algorithm.
102+
compression = hvd.Compression.fp16 if args.fp16_allreduce else hvd.Compression.none
103+
100104
# Horovod: wrap optimizer with DistributedOptimizer.
101-
optimizer = hvd.DistributedOptimizer(
102-
optimizer, named_parameters=model.named_parameters())
105+
optimizer = hvd.DistributedOptimizer(optimizer,
106+
named_parameters=model.named_parameters(),
107+
compression=compression)
103108

104109

105110
def train(epoch):
@@ -108,7 +113,6 @@ def train(epoch):
108113
for batch_idx, (data, target) in enumerate(train_loader):
109114
if args.cuda:
110115
data, target = data.cuda(), target.cuda()
111-
data, target = Variable(data), Variable(target)
112116
optimizer.zero_grad()
113117
output = model(data)
114118
loss = F.nll_loss(output, target)
@@ -117,13 +121,16 @@ def train(epoch):
117121
if batch_idx % args.log_interval == 0:
118122
print('Train Epoch: {} [{}/{} ({:.0f}%)]\tLoss: {:.6f}'.format(
119123
epoch, batch_idx * len(data), len(train_sampler),
120-
100. * batch_idx / len(train_loader), loss.data[0]))
124+
100. * batch_idx / len(train_loader), loss.item()))
125+
126+
# log the loss to the Azure ML run
127+
run.log('loss', loss.item())
121128

122129

123130
def metric_average(val, name):
124-
tensor = torch.FloatTensor([val])
131+
tensor = torch.tensor(val)
125132
avg_tensor = hvd.allreduce(tensor, name=name)
126-
return avg_tensor[0]
133+
return avg_tensor.item()
127134

128135

129136
def test():
@@ -133,10 +140,9 @@ def test():
133140
for data, target in test_loader:
134141
if args.cuda:
135142
data, target = data.cuda(), target.cuda()
136-
data, target = Variable(data, volatile=True), Variable(target)
137143
output = model(data)
138144
# sum up batch loss
139-
test_loss += F.nll_loss(output, target, size_average=False).data[0]
145+
test_loss += F.nll_loss(output, target, size_average=False).item()
140146
# get the index of the max log-probability
141147
pred = output.data.max(1, keepdim=True)[1]
142148
test_accuracy += pred.eq(target.data.view_as(pred)).cpu().float().sum()

0 commit comments

Comments
 (0)