Skip to content

Commit e7ce245

Browse files
authored
Merge pull request Azure#92 from dipankar-ray/master
updated pipeline notebooks with expanded tutorial
2 parents e039b98 + ef5844f commit e7ce245

19 files changed

+3780
-273
lines changed

pipeline/20news.pkl

5.35 MB
Binary file not shown.

pipeline/README.md

Lines changed: 48 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,48 @@
1+
# Azure Machine Learning Pipeline
2+
3+
## Overview
4+
5+
The [Azure Machine Learning Pipelines](https://docs.microsoft.com/en-us/azure/machine-learning/service/concept-ml-pipelines) enables data scientists to create and manage multiple simple and complex workflows concurrently. A typical pipeline would have multiple tasks to prepare data, train, deploy and evaluate models. Individual steps in the pipeline can make use of diverse compute options (for example: CPU for data preparation and GPU for training) and languages.
6+
7+
The Python-based Azure Machine Learning Pipeline SDK provides interfaces to work with Azure Machine Learning Pipelines. To get started quickly, the SDK includes imperative constructs for sequencing and parallelization of steps. With the use of declarative data dependencies, optimized execution of the tasks can be achieved. The SDK can be easily used from Jupyter Notebook or any other preferred IDE. The SDK includes a framework of pre-built modules for common tasks such as data transfer and compute provisioning.
8+
9+
Data management and reuse across pipelines and pipeline runs is simplified using named and strictly versioned data sources and named inputs and outputs for processing tasks. Pipelines enable collaboration across teams of data scientists by recording all intermediate tasks and data.
10+
11+
### Why build pipelines?
12+
13+
With pipelines, you can optimize your workflow with simplicity, speed, portability, and reuse. When building pipelines with Azure Machine Learning, you can focus on what you know best — machine learning — rather than infrastructure.
14+
15+
Using distinct steps makes it possible to rerun only the steps you need as you tweak and test your workflow. Once the pipeline is designed, there is often more fine-tuning around the training loop of the pipeline. When you rerun a pipeline, the execution jumps to the steps that need to be rerun, such as an updated training script, and skips what hasn't changed. The same paradigm applies to unchanged scripts and metadata.
16+
17+
With Azure Machine Learning, you can use distinct toolkits and frameworks for each step in your pipeline. Azure coordinates between the various compute targets you use so that your intermediate data can be shared with the downstream compute targets easily.
18+
19+
![MLLifecycle](aml-pipelines-concept.png)
20+
21+
22+
### Azure Machine Learning Pipelines Features
23+
Azure Machine Learning Pipelines optimize for simplicity, speed, and efficiency. The following key concepts make it possible for a data scientist to focus on ML rather than infrastructure.
24+
25+
**Unattended execution**: Schedule a few scripts to run in parallel or in sequence in a reliable and unattended manner. Since data prep and modeling can last days or weeks, you can now focus on other tasks while your pipeline is running.
26+
27+
**Mixed and diverse compute**: Use multiple pipelines that are reliably coordinated across heterogeneous and scalable computes and storages. Individual pipeline steps can be run on different compute targets, such as HDInsight, GPU Data Science VMs, and Databricks, to make efficient use of available compute options.
28+
29+
**Reusability**: Pipelines can be templatized for specific scenarios such as retraining and batch scoring. They can be triggered from external systems via simple REST calls.
30+
31+
**Tracking and versioning**: Instead of manually tracking data and result paths as you iterate, use the pipelines SDK to explicitly name and version your data sources, inputs, and outputs as well as manage scripts and data separately for increased productivity.
32+
33+
### Notebooks
34+
35+
In this directory, there are two types of notebooks:
36+
37+
* The first type of notebooks will introduce you to core Azure Machine Learning Pipelines features. The notebooks below belong in this category, and are designed to go in sequence:
38+
39+
1. aml-pipelines-getting-started.ipynb
40+
2. aml-pipelines-with-data-dependency-steps.ipynb
41+
3. aml-pipelines-publish-and-run-using-rest-endpoint.ipynb
42+
4. aml-pipelines-data-transfer.ipynb
43+
5. aml-pipelines-use-databricks-as-compute-target.ipynb
44+
6. aml-pipelines-use-adla-as-compute-target.ipynb
45+
46+
* The second type of notebooks illustrate more sophisticated scenarios, and are independent of each other. These notebooks include:
47+
- pipeline-batch-scoring.ipynb
48+
- pipeline-style-transfer.ipynb

pipeline/aml-pipelines-concept.png

23.9 KB
Loading
Lines changed: 336 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,336 @@
1+
{
2+
"cells": [
3+
{
4+
"cell_type": "markdown",
5+
"metadata": {},
6+
"source": [
7+
"Copyright (c) Microsoft Corporation. All rights reserved. \n",
8+
"Licensed under the MIT License."
9+
]
10+
},
11+
{
12+
"cell_type": "markdown",
13+
"metadata": {},
14+
"source": [
15+
"# Azure Machine Learning Pipeline with DataTranferStep\n",
16+
"This notebook is used to demonstrate the use of DataTranferStep in Azure Machine Learning Pipeline.\n",
17+
"\n",
18+
"In certain cases, you will need to transfer data from one data location to another. For example, your data may be in Files storage and you may want to move it to Blob storage. Or, if your data is in an ADLS account and you want to make it available in the Blob storage. The built-in **DataTransferStep** class helps you transfer data in these situations.\n",
19+
"\n",
20+
"The below example shows how to move data in an ADLS account to Blob storage."
21+
]
22+
},
23+
{
24+
"cell_type": "markdown",
25+
"metadata": {},
26+
"source": [
27+
"## Azure Machine Learning and Pipeline SDK-specific imports"
28+
]
29+
},
30+
{
31+
"cell_type": "code",
32+
"execution_count": null,
33+
"metadata": {},
34+
"outputs": [],
35+
"source": [
36+
"import os\n",
37+
"import azureml.core\n",
38+
"from azureml.core.compute import ComputeTarget, DatabricksCompute, DataFactoryCompute\n",
39+
"from azureml.exceptions import ComputeTargetException\n",
40+
"from azureml.core import Workspace, Run, Experiment\n",
41+
"from azureml.pipeline.core import Pipeline, PipelineData\n",
42+
"from azureml.pipeline.steps import AdlaStep\n",
43+
"from azureml.core.datastore import Datastore\n",
44+
"from azureml.data.data_reference import DataReference\n",
45+
"from azureml.data.sql_data_reference import SqlDataReference\n",
46+
"from azureml.core import attach_legacy_compute_target\n",
47+
"from azureml.data.stored_procedure_parameter import StoredProcedureParameter, StoredProcedureParameterType\n",
48+
"from azureml.pipeline.steps import DataTransferStep\n",
49+
"\n",
50+
"# Check core SDK version number\n",
51+
"print(\"SDK version:\", azureml.core.VERSION)"
52+
]
53+
},
54+
{
55+
"cell_type": "markdown",
56+
"metadata": {},
57+
"source": [
58+
"## Initialize Workspace\n",
59+
"\n",
60+
"Initialize a workspace object from persisted configuration. Make sure the config file is present at .\\config.json\n",
61+
"\n",
62+
"If you don't have a config.json file, please go through the configuration Notebook located here:\n",
63+
"https://github.com/Azure/MachineLearningNotebooks. \n",
64+
"\n",
65+
"This sets you up with a working config file that has information on your workspace, subscription id, etc. "
66+
]
67+
},
68+
{
69+
"cell_type": "code",
70+
"execution_count": null,
71+
"metadata": {
72+
"tags": [
73+
"create workspace"
74+
]
75+
},
76+
"outputs": [],
77+
"source": [
78+
"ws = Workspace.from_config()\n",
79+
"print(ws.name, ws.resource_group, ws.location, ws.subscription_id, sep = '\\n')"
80+
]
81+
},
82+
{
83+
"cell_type": "markdown",
84+
"metadata": {},
85+
"source": [
86+
"## Register Datastores\n",
87+
"\n",
88+
"In the code cell below, you will need to fill in the appropriate values for the workspace name, datastore name, subscription id, resource group, store name, tenant id, client id, and client secret that are associated with your ADLS datastore. \n",
89+
"\n",
90+
"For background on registering your data store, consult this article:\n",
91+
"\n",
92+
"https://docs.microsoft.com/en-us/azure/data-lake-store/data-lake-store-service-to-service-authenticate-using-active-directory"
93+
]
94+
},
95+
{
96+
"cell_type": "code",
97+
"execution_count": null,
98+
"metadata": {},
99+
"outputs": [],
100+
"source": [
101+
"# un-comment the following and replace the strings with the \n",
102+
"# correct values for your ADLS datastore\n",
103+
"\n",
104+
"# workspace = \"<my-workspace-name>\"\n",
105+
"# datastore_name = \"<my-datastore-name>\" # ADLS datastore name\n",
106+
"# subscription_id = \"<my-subscription-id>\" # subscription id of ADLS account\n",
107+
"# resource_group = \"<my-resource-group>\" # resource group of ADLS account\n",
108+
"# store_name = \"<my-storename>\" # ADLS account name\n",
109+
"# tenant_id = \"<my-tenant-id>\" # tenant id of service principal\n",
110+
"# client_id = \"<my-client-id>\" # client id of service principal\n",
111+
"# client_secret = \"<my-client-secret>\" # the secret of service principal\n",
112+
"\n",
113+
"\n",
114+
"try:\n",
115+
" adls_datastore = Datastore.get(ws, datastore_name)\n",
116+
" print(\"found datastore with name: %s\" % datastore_name)\n",
117+
"except:\n",
118+
" adls_datastore = Datastore.register_azure_data_lake(\n",
119+
" workspace=ws,\n",
120+
" datastore_name=datastore_name,\n",
121+
" subscription_id=subscription_id, # subscription id of ADLS account\n",
122+
" resource_group=resource_group, # resource group of ADLS account\n",
123+
" store_name=store_name, # ADLS account name\n",
124+
" tenant_id=tenant_id, # tenant id of service principal\n",
125+
" client_id=client_id, # client id of service principal\n",
126+
" client_secret=client_secret) # the secret of service principal\n",
127+
" print(\"registered datastore with name: %s\" % datastore_name)\n",
128+
"\n",
129+
"# un-comment the following and replace the strings with the\n",
130+
"# correct values for your blob datastore\n",
131+
"\n",
132+
"# blob_datastore_name = \"<my-blob-datastore-name>\"\n",
133+
"# account_name = \"<my-blob-account-name>\"\n",
134+
"# container_name = \"<my-blob-container-name>\"\n",
135+
"# account_key = \"<my-blob-account-key>\"\n",
136+
"\n",
137+
"try:\n",
138+
" blob_datastore = Datastore.get(ws, blob_datastore_name)\n",
139+
" print(\"found blob datastore with name: %s\" % blob_datastore_name)\n",
140+
"except:\n",
141+
" blob_datastore = Datastore.register_azure_blob_container(\n",
142+
" workspace=ws,\n",
143+
" datastore_name=blob_datastore_name,\n",
144+
" account_name=account_name, # Storage account name\n",
145+
" container_name=container_name, # Name of Azure blob container\n",
146+
" account_key=account_key) # Storage account key\"\n",
147+
" print(\"registered blob datastore with name: %s\" % blob_datastore_name)\n",
148+
"\n",
149+
"# CLI:\n",
150+
"# az ml datastore register-blob -n <datastore-name> -a <account-name> -c <container-name> -k <account-key> [-t <sas-token>]"
151+
]
152+
},
153+
{
154+
"cell_type": "markdown",
155+
"metadata": {},
156+
"source": [
157+
"## Create DataReferences"
158+
]
159+
},
160+
{
161+
"cell_type": "code",
162+
"execution_count": null,
163+
"metadata": {},
164+
"outputs": [],
165+
"source": [
166+
"adls_datastore = Datastore(workspace=ws, name=\"MyAdlsDatastore\")\n",
167+
"\n",
168+
"# adls\n",
169+
"adls_data_ref = DataReference(\n",
170+
" datastore=adls_datastore,\n",
171+
" data_reference_name=\"adls_test_data\",\n",
172+
" path_on_datastore=\"testdata\")\n",
173+
"\n",
174+
"blob_datastore = Datastore(workspace=ws, name=\"MyBlobDatastore\")\n",
175+
"\n",
176+
"# blob data\n",
177+
"blob_data_ref = DataReference(\n",
178+
" datastore=blob_datastore,\n",
179+
" data_reference_name=\"blob_test_data\",\n",
180+
" path_on_datastore=\"testdata\")\n",
181+
"\n",
182+
"print(\"obtained adls, blob data references\")"
183+
]
184+
},
185+
{
186+
"cell_type": "markdown",
187+
"metadata": {},
188+
"source": [
189+
"## Setup Data Factory Account"
190+
]
191+
},
192+
{
193+
"cell_type": "code",
194+
"execution_count": null,
195+
"metadata": {},
196+
"outputs": [],
197+
"source": [
198+
"data_factory_name = 'adftest'\n",
199+
"\n",
200+
"def get_or_create_data_factory(workspace, factory_name):\n",
201+
" try:\n",
202+
" return DataFactoryCompute(workspace, factory_name)\n",
203+
" except ComputeTargetException as e:\n",
204+
" if 'ComputeTargetNotFound' in e.message:\n",
205+
" print('Data factory not found, creating...')\n",
206+
" provisioning_config = DataFactoryCompute.provisioning_configuration()\n",
207+
" data_factory = ComputeTarget.create(workspace, factory_name, provisioning_config)\n",
208+
" data_factory.wait_for_provisioning()\n",
209+
" return data_factory\n",
210+
" else:\n",
211+
" raise e\n",
212+
" \n",
213+
"data_factory_compute = get_or_create_data_factory(ws, data_factory_name)\n",
214+
"\n",
215+
"print(\"setup data factory account complete\")\n",
216+
"\n",
217+
"# CLI:\n",
218+
"# Create: az ml computetarget setup datafactory -n <name>\n",
219+
"# BYOC: az ml computetarget attach datafactory -n <name> -i <resource-id>"
220+
]
221+
},
222+
{
223+
"cell_type": "markdown",
224+
"metadata": {},
225+
"source": [
226+
"## Create a DataTransferStep"
227+
]
228+
},
229+
{
230+
"cell_type": "markdown",
231+
"metadata": {},
232+
"source": [
233+
"**DataTransferStep** is used to transfer data between Azure Blob, Azure Data Lake Store, and Azure SQL database.\n",
234+
"\n",
235+
"- **name:** Name of module\n",
236+
"- **source_data_reference:** Input connection that serves as source of data transfer operation.\n",
237+
"- **destination_data_reference:** Input connection that serves as destination of data transfer operation.\n",
238+
"- **compute_target:** Azure Data Factory to use for transferring data.\n",
239+
"- **allow_reuse:** Whether the step should reuse results of previous DataTransferStep when run with same inputs. Set as False to force data to be transferred again.\n",
240+
"\n",
241+
"Optional arguments to explicitly specify whether a path corresponds to a file or a directory. These are useful when storage contains both file and directory with the same name or when creating a new destination path.\n",
242+
"\n",
243+
"- **source_reference_type:** An optional string specifying the type of source_data_reference. Possible values include: 'file', 'directory'. When not specified, we use the type of existing path or directory if it's a new path.\n",
244+
"- **destination_reference_type:** An optional string specifying the type of destination_data_reference. Possible values include: 'file', 'directory'. When not specified, we use the type of existing path or directory if it's a new path."
245+
]
246+
},
247+
{
248+
"cell_type": "code",
249+
"execution_count": null,
250+
"metadata": {},
251+
"outputs": [],
252+
"source": [
253+
"transfer_adls_to_blob = DataTransferStep(\n",
254+
" name=\"transfer_adls_to_blob\",\n",
255+
" source_data_reference=adls_data_ref,\n",
256+
" destination_data_reference=blob_data_ref,\n",
257+
" compute_target=data_factory_compute)\n",
258+
"\n",
259+
"print(\"data transfer step created\")"
260+
]
261+
},
262+
{
263+
"cell_type": "markdown",
264+
"metadata": {},
265+
"source": [
266+
"## Build and Submit the Experiment"
267+
]
268+
},
269+
{
270+
"cell_type": "code",
271+
"execution_count": null,
272+
"metadata": {},
273+
"outputs": [],
274+
"source": [
275+
"pipeline = Pipeline(\n",
276+
" description=\"data_transfer_101\",\n",
277+
" workspace=ws,\n",
278+
" steps=[transfer_adls_to_blob])\n",
279+
"\n",
280+
"pipeline_run = Experiment(ws, \"Data_Transfer_example\").submit(pipeline)\n",
281+
"pipeline_run.wait_for_completion()"
282+
]
283+
},
284+
{
285+
"cell_type": "markdown",
286+
"metadata": {},
287+
"source": [
288+
"### View Run Details"
289+
]
290+
},
291+
{
292+
"cell_type": "code",
293+
"execution_count": null,
294+
"metadata": {},
295+
"outputs": [],
296+
"source": [
297+
"from azureml.widgets import RunDetails\n",
298+
"RunDetails(pipeline_run).show()"
299+
]
300+
},
301+
{
302+
"cell_type": "markdown",
303+
"metadata": {},
304+
"source": [
305+
"# Next: Databricks as a Compute Target\n",
306+
"To use Databricks as a compute target from Azure Machine Learning Pipeline, a DatabricksStep is used. This [notebook](./aml-pipelines-use-databricks-as-compute-target.ipynb) demonstrates the use of a DatabricksStep in an Azure Machine Learning Pipeline."
307+
]
308+
}
309+
],
310+
"metadata": {
311+
"authors": [
312+
{
313+
"name": "diray"
314+
}
315+
],
316+
"kernelspec": {
317+
"display_name": "Python 3",
318+
"language": "python",
319+
"name": "python3"
320+
},
321+
"language_info": {
322+
"codemirror_mode": {
323+
"name": "ipython",
324+
"version": 3
325+
},
326+
"file_extension": ".py",
327+
"mimetype": "text/x-python",
328+
"name": "python",
329+
"nbconvert_exporter": "python",
330+
"pygments_lexer": "ipython3",
331+
"version": "3.6.7"
332+
}
333+
},
334+
"nbformat": 4,
335+
"nbformat_minor": 2
336+
}

0 commit comments

Comments
 (0)