|
| 1 | +{ |
| 2 | + "cells": [ |
| 3 | + { |
| 4 | + "cell_type": "markdown", |
| 5 | + "metadata": {}, |
| 6 | + "source": [ |
| 7 | + "Copyright (c) Microsoft Corporation. All rights reserved. \n", |
| 8 | + "Licensed under the MIT License." |
| 9 | + ] |
| 10 | + }, |
| 11 | + { |
| 12 | + "cell_type": "markdown", |
| 13 | + "metadata": {}, |
| 14 | + "source": [ |
| 15 | + "# Azure Machine Learning Pipeline with DataTranferStep\n", |
| 16 | + "This notebook is used to demonstrate the use of DataTranferStep in Azure Machine Learning Pipeline.\n", |
| 17 | + "\n", |
| 18 | + "In certain cases, you will need to transfer data from one data location to another. For example, your data may be in Files storage and you may want to move it to Blob storage. Or, if your data is in an ADLS account and you want to make it available in the Blob storage. The built-in **DataTransferStep** class helps you transfer data in these situations.\n", |
| 19 | + "\n", |
| 20 | + "The below example shows how to move data in an ADLS account to Blob storage." |
| 21 | + ] |
| 22 | + }, |
| 23 | + { |
| 24 | + "cell_type": "markdown", |
| 25 | + "metadata": {}, |
| 26 | + "source": [ |
| 27 | + "## Azure Machine Learning and Pipeline SDK-specific imports" |
| 28 | + ] |
| 29 | + }, |
| 30 | + { |
| 31 | + "cell_type": "code", |
| 32 | + "execution_count": null, |
| 33 | + "metadata": {}, |
| 34 | + "outputs": [], |
| 35 | + "source": [ |
| 36 | + "import os\n", |
| 37 | + "import azureml.core\n", |
| 38 | + "from azureml.core.compute import ComputeTarget, DatabricksCompute, DataFactoryCompute\n", |
| 39 | + "from azureml.exceptions import ComputeTargetException\n", |
| 40 | + "from azureml.core import Workspace, Run, Experiment\n", |
| 41 | + "from azureml.pipeline.core import Pipeline, PipelineData\n", |
| 42 | + "from azureml.pipeline.steps import AdlaStep\n", |
| 43 | + "from azureml.core.datastore import Datastore\n", |
| 44 | + "from azureml.data.data_reference import DataReference\n", |
| 45 | + "from azureml.data.sql_data_reference import SqlDataReference\n", |
| 46 | + "from azureml.core import attach_legacy_compute_target\n", |
| 47 | + "from azureml.data.stored_procedure_parameter import StoredProcedureParameter, StoredProcedureParameterType\n", |
| 48 | + "from azureml.pipeline.steps import DataTransferStep\n", |
| 49 | + "\n", |
| 50 | + "# Check core SDK version number\n", |
| 51 | + "print(\"SDK version:\", azureml.core.VERSION)" |
| 52 | + ] |
| 53 | + }, |
| 54 | + { |
| 55 | + "cell_type": "markdown", |
| 56 | + "metadata": {}, |
| 57 | + "source": [ |
| 58 | + "## Initialize Workspace\n", |
| 59 | + "\n", |
| 60 | + "Initialize a workspace object from persisted configuration. Make sure the config file is present at .\\config.json\n", |
| 61 | + "\n", |
| 62 | + "If you don't have a config.json file, please go through the configuration Notebook located here:\n", |
| 63 | + "https://github.com/Azure/MachineLearningNotebooks. \n", |
| 64 | + "\n", |
| 65 | + "This sets you up with a working config file that has information on your workspace, subscription id, etc. " |
| 66 | + ] |
| 67 | + }, |
| 68 | + { |
| 69 | + "cell_type": "code", |
| 70 | + "execution_count": null, |
| 71 | + "metadata": { |
| 72 | + "tags": [ |
| 73 | + "create workspace" |
| 74 | + ] |
| 75 | + }, |
| 76 | + "outputs": [], |
| 77 | + "source": [ |
| 78 | + "ws = Workspace.from_config()\n", |
| 79 | + "print(ws.name, ws.resource_group, ws.location, ws.subscription_id, sep = '\\n')" |
| 80 | + ] |
| 81 | + }, |
| 82 | + { |
| 83 | + "cell_type": "markdown", |
| 84 | + "metadata": {}, |
| 85 | + "source": [ |
| 86 | + "## Register Datastores\n", |
| 87 | + "\n", |
| 88 | + "In the code cell below, you will need to fill in the appropriate values for the workspace name, datastore name, subscription id, resource group, store name, tenant id, client id, and client secret that are associated with your ADLS datastore. \n", |
| 89 | + "\n", |
| 90 | + "For background on registering your data store, consult this article:\n", |
| 91 | + "\n", |
| 92 | + "https://docs.microsoft.com/en-us/azure/data-lake-store/data-lake-store-service-to-service-authenticate-using-active-directory" |
| 93 | + ] |
| 94 | + }, |
| 95 | + { |
| 96 | + "cell_type": "code", |
| 97 | + "execution_count": null, |
| 98 | + "metadata": {}, |
| 99 | + "outputs": [], |
| 100 | + "source": [ |
| 101 | + "# un-comment the following and replace the strings with the \n", |
| 102 | + "# correct values for your ADLS datastore\n", |
| 103 | + "\n", |
| 104 | + "# workspace = \"<my-workspace-name>\"\n", |
| 105 | + "# datastore_name = \"<my-datastore-name>\" # ADLS datastore name\n", |
| 106 | + "# subscription_id = \"<my-subscription-id>\" # subscription id of ADLS account\n", |
| 107 | + "# resource_group = \"<my-resource-group>\" # resource group of ADLS account\n", |
| 108 | + "# store_name = \"<my-storename>\" # ADLS account name\n", |
| 109 | + "# tenant_id = \"<my-tenant-id>\" # tenant id of service principal\n", |
| 110 | + "# client_id = \"<my-client-id>\" # client id of service principal\n", |
| 111 | + "# client_secret = \"<my-client-secret>\" # the secret of service principal\n", |
| 112 | + "\n", |
| 113 | + "\n", |
| 114 | + "try:\n", |
| 115 | + " adls_datastore = Datastore.get(ws, datastore_name)\n", |
| 116 | + " print(\"found datastore with name: %s\" % datastore_name)\n", |
| 117 | + "except:\n", |
| 118 | + " adls_datastore = Datastore.register_azure_data_lake(\n", |
| 119 | + " workspace=ws,\n", |
| 120 | + " datastore_name=datastore_name,\n", |
| 121 | + " subscription_id=subscription_id, # subscription id of ADLS account\n", |
| 122 | + " resource_group=resource_group, # resource group of ADLS account\n", |
| 123 | + " store_name=store_name, # ADLS account name\n", |
| 124 | + " tenant_id=tenant_id, # tenant id of service principal\n", |
| 125 | + " client_id=client_id, # client id of service principal\n", |
| 126 | + " client_secret=client_secret) # the secret of service principal\n", |
| 127 | + " print(\"registered datastore with name: %s\" % datastore_name)\n", |
| 128 | + "\n", |
| 129 | + "# un-comment the following and replace the strings with the\n", |
| 130 | + "# correct values for your blob datastore\n", |
| 131 | + "\n", |
| 132 | + "# blob_datastore_name = \"<my-blob-datastore-name>\"\n", |
| 133 | + "# account_name = \"<my-blob-account-name>\"\n", |
| 134 | + "# container_name = \"<my-blob-container-name>\"\n", |
| 135 | + "# account_key = \"<my-blob-account-key>\"\n", |
| 136 | + "\n", |
| 137 | + "try:\n", |
| 138 | + " blob_datastore = Datastore.get(ws, blob_datastore_name)\n", |
| 139 | + " print(\"found blob datastore with name: %s\" % blob_datastore_name)\n", |
| 140 | + "except:\n", |
| 141 | + " blob_datastore = Datastore.register_azure_blob_container(\n", |
| 142 | + " workspace=ws,\n", |
| 143 | + " datastore_name=blob_datastore_name,\n", |
| 144 | + " account_name=account_name, # Storage account name\n", |
| 145 | + " container_name=container_name, # Name of Azure blob container\n", |
| 146 | + " account_key=account_key) # Storage account key\"\n", |
| 147 | + " print(\"registered blob datastore with name: %s\" % blob_datastore_name)\n", |
| 148 | + "\n", |
| 149 | + "# CLI:\n", |
| 150 | + "# az ml datastore register-blob -n <datastore-name> -a <account-name> -c <container-name> -k <account-key> [-t <sas-token>]" |
| 151 | + ] |
| 152 | + }, |
| 153 | + { |
| 154 | + "cell_type": "markdown", |
| 155 | + "metadata": {}, |
| 156 | + "source": [ |
| 157 | + "## Create DataReferences" |
| 158 | + ] |
| 159 | + }, |
| 160 | + { |
| 161 | + "cell_type": "code", |
| 162 | + "execution_count": null, |
| 163 | + "metadata": {}, |
| 164 | + "outputs": [], |
| 165 | + "source": [ |
| 166 | + "adls_datastore = Datastore(workspace=ws, name=\"MyAdlsDatastore\")\n", |
| 167 | + "\n", |
| 168 | + "# adls\n", |
| 169 | + "adls_data_ref = DataReference(\n", |
| 170 | + " datastore=adls_datastore,\n", |
| 171 | + " data_reference_name=\"adls_test_data\",\n", |
| 172 | + " path_on_datastore=\"testdata\")\n", |
| 173 | + "\n", |
| 174 | + "blob_datastore = Datastore(workspace=ws, name=\"MyBlobDatastore\")\n", |
| 175 | + "\n", |
| 176 | + "# blob data\n", |
| 177 | + "blob_data_ref = DataReference(\n", |
| 178 | + " datastore=blob_datastore,\n", |
| 179 | + " data_reference_name=\"blob_test_data\",\n", |
| 180 | + " path_on_datastore=\"testdata\")\n", |
| 181 | + "\n", |
| 182 | + "print(\"obtained adls, blob data references\")" |
| 183 | + ] |
| 184 | + }, |
| 185 | + { |
| 186 | + "cell_type": "markdown", |
| 187 | + "metadata": {}, |
| 188 | + "source": [ |
| 189 | + "## Setup Data Factory Account" |
| 190 | + ] |
| 191 | + }, |
| 192 | + { |
| 193 | + "cell_type": "code", |
| 194 | + "execution_count": null, |
| 195 | + "metadata": {}, |
| 196 | + "outputs": [], |
| 197 | + "source": [ |
| 198 | + "data_factory_name = 'adftest'\n", |
| 199 | + "\n", |
| 200 | + "def get_or_create_data_factory(workspace, factory_name):\n", |
| 201 | + " try:\n", |
| 202 | + " return DataFactoryCompute(workspace, factory_name)\n", |
| 203 | + " except ComputeTargetException as e:\n", |
| 204 | + " if 'ComputeTargetNotFound' in e.message:\n", |
| 205 | + " print('Data factory not found, creating...')\n", |
| 206 | + " provisioning_config = DataFactoryCompute.provisioning_configuration()\n", |
| 207 | + " data_factory = ComputeTarget.create(workspace, factory_name, provisioning_config)\n", |
| 208 | + " data_factory.wait_for_provisioning()\n", |
| 209 | + " return data_factory\n", |
| 210 | + " else:\n", |
| 211 | + " raise e\n", |
| 212 | + " \n", |
| 213 | + "data_factory_compute = get_or_create_data_factory(ws, data_factory_name)\n", |
| 214 | + "\n", |
| 215 | + "print(\"setup data factory account complete\")\n", |
| 216 | + "\n", |
| 217 | + "# CLI:\n", |
| 218 | + "# Create: az ml computetarget setup datafactory -n <name>\n", |
| 219 | + "# BYOC: az ml computetarget attach datafactory -n <name> -i <resource-id>" |
| 220 | + ] |
| 221 | + }, |
| 222 | + { |
| 223 | + "cell_type": "markdown", |
| 224 | + "metadata": {}, |
| 225 | + "source": [ |
| 226 | + "## Create a DataTransferStep" |
| 227 | + ] |
| 228 | + }, |
| 229 | + { |
| 230 | + "cell_type": "markdown", |
| 231 | + "metadata": {}, |
| 232 | + "source": [ |
| 233 | + "**DataTransferStep** is used to transfer data between Azure Blob, Azure Data Lake Store, and Azure SQL database.\n", |
| 234 | + "\n", |
| 235 | + "- **name:** Name of module\n", |
| 236 | + "- **source_data_reference:** Input connection that serves as source of data transfer operation.\n", |
| 237 | + "- **destination_data_reference:** Input connection that serves as destination of data transfer operation.\n", |
| 238 | + "- **compute_target:** Azure Data Factory to use for transferring data.\n", |
| 239 | + "- **allow_reuse:** Whether the step should reuse results of previous DataTransferStep when run with same inputs. Set as False to force data to be transferred again.\n", |
| 240 | + "\n", |
| 241 | + "Optional arguments to explicitly specify whether a path corresponds to a file or a directory. These are useful when storage contains both file and directory with the same name or when creating a new destination path.\n", |
| 242 | + "\n", |
| 243 | + "- **source_reference_type:** An optional string specifying the type of source_data_reference. Possible values include: 'file', 'directory'. When not specified, we use the type of existing path or directory if it's a new path.\n", |
| 244 | + "- **destination_reference_type:** An optional string specifying the type of destination_data_reference. Possible values include: 'file', 'directory'. When not specified, we use the type of existing path or directory if it's a new path." |
| 245 | + ] |
| 246 | + }, |
| 247 | + { |
| 248 | + "cell_type": "code", |
| 249 | + "execution_count": null, |
| 250 | + "metadata": {}, |
| 251 | + "outputs": [], |
| 252 | + "source": [ |
| 253 | + "transfer_adls_to_blob = DataTransferStep(\n", |
| 254 | + " name=\"transfer_adls_to_blob\",\n", |
| 255 | + " source_data_reference=adls_data_ref,\n", |
| 256 | + " destination_data_reference=blob_data_ref,\n", |
| 257 | + " compute_target=data_factory_compute)\n", |
| 258 | + "\n", |
| 259 | + "print(\"data transfer step created\")" |
| 260 | + ] |
| 261 | + }, |
| 262 | + { |
| 263 | + "cell_type": "markdown", |
| 264 | + "metadata": {}, |
| 265 | + "source": [ |
| 266 | + "## Build and Submit the Experiment" |
| 267 | + ] |
| 268 | + }, |
| 269 | + { |
| 270 | + "cell_type": "code", |
| 271 | + "execution_count": null, |
| 272 | + "metadata": {}, |
| 273 | + "outputs": [], |
| 274 | + "source": [ |
| 275 | + "pipeline = Pipeline(\n", |
| 276 | + " description=\"data_transfer_101\",\n", |
| 277 | + " workspace=ws,\n", |
| 278 | + " steps=[transfer_adls_to_blob])\n", |
| 279 | + "\n", |
| 280 | + "pipeline_run = Experiment(ws, \"Data_Transfer_example\").submit(pipeline)\n", |
| 281 | + "pipeline_run.wait_for_completion()" |
| 282 | + ] |
| 283 | + }, |
| 284 | + { |
| 285 | + "cell_type": "markdown", |
| 286 | + "metadata": {}, |
| 287 | + "source": [ |
| 288 | + "### View Run Details" |
| 289 | + ] |
| 290 | + }, |
| 291 | + { |
| 292 | + "cell_type": "code", |
| 293 | + "execution_count": null, |
| 294 | + "metadata": {}, |
| 295 | + "outputs": [], |
| 296 | + "source": [ |
| 297 | + "from azureml.widgets import RunDetails\n", |
| 298 | + "RunDetails(pipeline_run).show()" |
| 299 | + ] |
| 300 | + }, |
| 301 | + { |
| 302 | + "cell_type": "markdown", |
| 303 | + "metadata": {}, |
| 304 | + "source": [ |
| 305 | + "# Next: Databricks as a Compute Target\n", |
| 306 | + "To use Databricks as a compute target from Azure Machine Learning Pipeline, a DatabricksStep is used. This [notebook](./aml-pipelines-use-databricks-as-compute-target.ipynb) demonstrates the use of a DatabricksStep in an Azure Machine Learning Pipeline." |
| 307 | + ] |
| 308 | + } |
| 309 | + ], |
| 310 | + "metadata": { |
| 311 | + "authors": [ |
| 312 | + { |
| 313 | + "name": "diray" |
| 314 | + } |
| 315 | + ], |
| 316 | + "kernelspec": { |
| 317 | + "display_name": "Python 3", |
| 318 | + "language": "python", |
| 319 | + "name": "python3" |
| 320 | + }, |
| 321 | + "language_info": { |
| 322 | + "codemirror_mode": { |
| 323 | + "name": "ipython", |
| 324 | + "version": 3 |
| 325 | + }, |
| 326 | + "file_extension": ".py", |
| 327 | + "mimetype": "text/x-python", |
| 328 | + "name": "python", |
| 329 | + "nbconvert_exporter": "python", |
| 330 | + "pygments_lexer": "ipython3", |
| 331 | + "version": "3.6.7" |
| 332 | + } |
| 333 | + }, |
| 334 | + "nbformat": 4, |
| 335 | + "nbformat_minor": 2 |
| 336 | +} |
0 commit comments