Azure · vizhur · Oct 23, 2019 · Oct 23, 2019
diff --git a/contrib/gbdt/lightgbm/binary0.test b/contrib/gbdt/lightgbm/binary0.test
diff --git a/contrib/gbdt/lightgbm/binary0.train b/contrib/gbdt/lightgbm/binary0.train
diff --git a/contrib/gbdt/lightgbm/binary1.test b/contrib/gbdt/lightgbm/binary1.test
diff --git a/contrib/gbdt/lightgbm/binary1.train b/contrib/gbdt/lightgbm/binary1.train
diff --git a/contrib/gbdt/lightgbm/lightgbm-example.ipynb b/contrib/gbdt/lightgbm/lightgbm-example.ipynb
@@ -0,0 +1,270 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Copyright (c) Microsoft Corporation. All rights reserved.  \n",
+    "Licensed under the MIT License."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "![Impressions](https://PixelServer20190423114238.azurewebsites.net/api/impressions/MachineLearningNotebooks/how-to-use-azureml/contrib/gbdt/lightgbm/lightgbm-example.png)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Use LightGBM Estimator in Azure Machine Learning\n",
+    "In this notebook we will demonstrate how to run a training job using LightGBM Estimator. [LightGBM](https://lightgbm.readthedocs.io/en/latest/) is a gradient boosting framework that uses tree based learning algorithms. "
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Prerequisites\n",
+    "This notebook uses azureml-contrib-gbdt package, if you don't already have the package, please install by uncommenting below cell."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#!pip install azureml-contrib-gbdt --extra-index-url https://azuremlsdktestpypi.azureedge.net/LightGBMPrivateRelease"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from azureml.core import Workspace, Run, Experiment\n",
+    "import shutil, os\n",
+    "from azureml.widgets import RunDetails\n",
+    "from azureml.contrib.gbdt import LightGBM\n",
+    "from azureml.train.dnn import Mpi\n",
+    "from azureml.core.compute import AmlCompute, ComputeTarget\n",
+    "from azureml.core.compute_target import ComputeTargetException"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "If you are using an AzureML Compute Instance, you are all set. Otherwise, go through the [configuration.ipynb](../../../configuration.ipynb) notebook to install the Azure Machine Learning Python SDK and create an Azure ML Workspace"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Set up machine learning resources"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "ws = Workspace.from_config()\n",
+    "\n",
+    "print('Workspace name: ' + ws.name, \n",
+    "      'Azure region: ' + ws.location, \n",
+    "      'Subscription id: ' + ws.subscription_id, \n",
+    "      'Resource group: ' + ws.resource_group, sep = '\\n')"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "cluster_vm_size = \"STANDARD_DS14_V2\"\n",
+    "cluster_min_nodes = 0\n",
+    "cluster_max_nodes = 20\n",
+    "cpu_cluster_name = 'TrainingCompute' \n",
+    "\n",
+    "try:\n",
+    "    cpu_cluster = AmlCompute(ws, cpu_cluster_name)\n",
+    "    if cpu_cluster and type(cpu_cluster) is AmlCompute:\n",
+    "        print('found compute target: ' + cpu_cluster_name)\n",
+    "except ComputeTargetException:\n",
+    "    print('creating a new compute target...')\n",
+    "    provisioning_config = AmlCompute.provisioning_configuration(vm_size = cluster_vm_size, \n",
+    "                                                                vm_priority = 'lowpriority', \n",
+    "                                                                min_nodes = cluster_min_nodes, \n",
+    "                                                                max_nodes = cluster_max_nodes)\n",
+    "    cpu_cluster = ComputeTarget.create(ws, cpu_cluster_name, provisioning_config)\n",
+    "    \n",
+    "    # can poll for a minimum number of nodes and for a specific timeout. \n",
+    "    # if no min node count is provided it will use the scale settings for the cluster\n",
+    "    cpu_cluster.wait_for_completion(show_output=True, min_node_count=None, timeout_in_minutes=20)\n",
+    "    \n",
+    "     # For a more detailed view of current Azure Machine Learning Compute  status, use get_status()\n",
+    "    print(cpu_cluster.get_status().serialize())"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "From this point, you can either upload training data file directly or use Datastore for training data storage\n",
+    "## Upload training file from local"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "scripts_folder = \"scripts_folder\"\n",
+    "if not os.path.isdir(scripts_folder):\n",
+    "    os.mkdir(scripts_folder)\n",
+    "shutil.copy('./train.conf', os.path.join(scripts_folder, 'train.conf'))\n",
+    "shutil.copy('./binary0.train', os.path.join(scripts_folder, 'binary0.train'))\n",
+    "shutil.copy('./binary1.train', os.path.join(scripts_folder, 'binary1.train'))\n",
+    "shutil.copy('./binary0.test', os.path.join(scripts_folder, 'binary0.test'))\n",
+    "shutil.copy('./binary1.test', os.path.join(scripts_folder, 'binary1.test'))"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "training_data_list=[\"binary0.train\", \"binary1.train\"]\n",
+    "validation_data_list = [\"binary0.test\", \"binary1.test\"]\n",
+    "lgbm = LightGBM(source_directory=scripts_folder, \n",
+    "                compute_target=cpu_cluster, \n",
+    "                distributed_training=Mpi(),\n",
+    "                node_count=2,\n",
+    "                lightgbm_config='train.conf',\n",
+    "                data=training_data_list,\n",
+    "                valid=validation_data_list\n",
+    "               )\n",
+    "experiment_name = 'lightgbm-estimator-test'\n",
+    "experiment = Experiment(ws, name=experiment_name)\n",
+    "run = experiment.submit(lgbm, tags={\"test public docker image\": None})\n",
+    "RunDetails(run).show()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "run.wait_for_completion(show_output=True)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Use data reference"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from azureml.core.datastore import Datastore\n",
+    "from azureml.data.data_reference import DataReference\n",
+    "datastore = ws.get_default_datastore()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "datastore.upload(src_dir='.',\n",
+    "                 target_path='.',\n",
+    "                 show_progress=True)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "training_data_list=[\"binary0.train\", \"binary1.train\"]\n",
+    "validation_data_list = [\"binary0.test\", \"binary1.test\"]\n",
+    "lgbm = LightGBM(source_directory='.', \n",
+    "                compute_target=cpu_cluster, \n",
+    "                distributed_training=Mpi(),\n",
+    "                node_count=2,\n",
+    "                inputs=[datastore.as_mount()],\n",
+    "                lightgbm_config='train.conf',\n",
+    "                data=training_data_list,\n",
+    "                valid=validation_data_list\n",
+    "               )\n",
+    "experiment_name = 'lightgbm-estimator-test'\n",
+    "experiment = Experiment(ws, name=experiment_name)\n",
+    "run = experiment.submit(lgbm, tags={\"use datastore.as_mount()\": None})\n",
+    "RunDetails(run).show()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "run.wait_for_completion(show_output=True)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# uncomment below and run if compute resources are no longer needed\n",
+    "# cpu_cluster.delete() "
+   ]
+  }
+ ],
+ "metadata": {
+  "authors": [
+   {
+    "name": "jingywa"
+   }
+  ],
+  "kernelspec": {
+   "display_name": "Python 3.6",
+   "language": "python",
+   "name": "python36"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.6.9"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}
diff --git a/contrib/gbdt/lightgbm/train.conf b/contrib/gbdt/lightgbm/train.conf
@@ -0,0 +1,111 @@
+# task type, support train and predict
+task = train
+
+# boosting type, support gbdt for now, alias: boosting, boost
+boosting_type = gbdt
+
+# application type, support following application
+# regression , regression task
+# binary , binary classification task
+# lambdarank , lambdarank task
+# alias: application, app
+objective = binary
+
+# eval metrics, support multi metric, delimite by ',' , support following metrics
+# l1 
+# l2 , default metric for regression
+# ndcg , default metric for lambdarank
+# auc 
+# binary_logloss , default metric for binary
+# binary_error
+metric = binary_logloss,auc
+
+# frequence for metric output
+metric_freq = 1
+
+# true if need output metric for training data, alias: tranining_metric, train_metric
+is_training_metric = true
+
+# number of bins for feature bucket, 255 is a recommend setting, it can save memories, and also has good accuracy. 
+max_bin = 255
+
+# training data
+# if exsting weight file, should name to "binary.train.weight"
+# alias: train_data, train
+data = binary.train
+
+# validation data, support multi validation data, separated by ','
+# if exsting weight file, should name to "binary.test.weight"
+# alias: valid, test, test_data, 
+valid_data = binary.test
+
+# number of trees(iterations), alias: num_tree, num_iteration, num_iterations, num_round, num_rounds
+num_trees = 100
+
+# shrinkage rate , alias: shrinkage_rate
+learning_rate = 0.1
+
+# number of leaves for one tree, alias: num_leaf
+num_leaves = 63
+
+# type of tree learner, support following types:
+# serial , single machine version
+# feature , use feature parallel to train
+# data , use data parallel to train
+# voting , use voting based parallel to train
+# alias: tree
+tree_learner = feature
+
+# number of threads for multi-threading. One thread will use one CPU, defalut is setted to #cpu. 
+# num_threads = 8
+
+# feature sub-sample, will random select 80% feature to train on each iteration 
+# alias: sub_feature
+feature_fraction = 0.8
+
+# Support bagging (data sub-sample), will perform bagging every 5 iterations
+bagging_freq = 5
+
+# Bagging farction, will random select 80% data on bagging
+# alias: sub_row
+bagging_fraction = 0.8
+
+# minimal number data for one leaf, use this to deal with over-fit
+# alias : min_data_per_leaf, min_data
+min_data_in_leaf = 50
+
+# minimal sum hessians for one leaf, use this to deal with over-fit
+min_sum_hessian_in_leaf = 5.0
+
+# save memory and faster speed for sparse feature, alias: is_sparse
+is_enable_sparse = true
+
+# when data is bigger than memory size, set this to true. otherwise set false will have faster speed
+# alias: two_round_loading, two_round
+use_two_round_loading = false
+
+# true if need to save data to binary file and application will auto load data from binary file next time
+# alias: is_save_binary, save_binary
+is_save_binary_file = false
+
+# output model file
+output_model = LightGBM_model.txt
+
+# support continuous train from trained gbdt model
+# input_model= trained_model.txt
+
+# output prediction file for predict task
+# output_result= prediction.txt
+
+# support continuous train from initial score file
+# input_init_score= init_score.txt
+
+
+# number of machines in parallel training, alias: num_machine
+num_machines = 2
+
+# local listening port in parallel training, alias: local_port
+local_listen_port = 12400
+
+# machines list file for parallel training, alias: mlist
+machine_list_file = mlist.txt