Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
500 changes: 500 additions & 0 deletions contrib/gbdt/lightgbm/binary0.test

Large diffs are not rendered by default.

7,000 changes: 7,000 additions & 0 deletions contrib/gbdt/lightgbm/binary0.train

Large diffs are not rendered by default.

500 changes: 500 additions & 0 deletions contrib/gbdt/lightgbm/binary1.test

Large diffs are not rendered by default.

7,000 changes: 7,000 additions & 0 deletions contrib/gbdt/lightgbm/binary1.train

Large diffs are not rendered by default.

270 changes: 270 additions & 0 deletions contrib/gbdt/lightgbm/lightgbm-example.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,270 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Copyright (c) Microsoft Corporation. All rights reserved. \n",
"Licensed under the MIT License."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"![Impressions](https://PixelServer20190423114238.azurewebsites.net/api/impressions/MachineLearningNotebooks/how-to-use-azureml/contrib/gbdt/lightgbm/lightgbm-example.png)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Use LightGBM Estimator in Azure Machine Learning\n",
"In this notebook we will demonstrate how to run a training job using LightGBM Estimator. [LightGBM](https://lightgbm.readthedocs.io/en/latest/) is a gradient boosting framework that uses tree based learning algorithms. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Prerequisites\n",
"This notebook uses azureml-contrib-gbdt package, if you don't already have the package, please install by uncommenting below cell."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#!pip install azureml-contrib-gbdt --extra-index-url https://azuremlsdktestpypi.azureedge.net/LightGBMPrivateRelease"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from azureml.core import Workspace, Run, Experiment\n",
"import shutil, os\n",
"from azureml.widgets import RunDetails\n",
"from azureml.contrib.gbdt import LightGBM\n",
"from azureml.train.dnn import Mpi\n",
"from azureml.core.compute import AmlCompute, ComputeTarget\n",
"from azureml.core.compute_target import ComputeTargetException"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"If you are using an AzureML Compute Instance, you are all set. Otherwise, go through the [configuration.ipynb](../../../configuration.ipynb) notebook to install the Azure Machine Learning Python SDK and create an Azure ML Workspace"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Set up machine learning resources"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"ws = Workspace.from_config()\n",
"\n",
"print('Workspace name: ' + ws.name, \n",
" 'Azure region: ' + ws.location, \n",
" 'Subscription id: ' + ws.subscription_id, \n",
" 'Resource group: ' + ws.resource_group, sep = '\\n')"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"cluster_vm_size = \"STANDARD_DS14_V2\"\n",
"cluster_min_nodes = 0\n",
"cluster_max_nodes = 20\n",
"cpu_cluster_name = 'TrainingCompute' \n",
"\n",
"try:\n",
" cpu_cluster = AmlCompute(ws, cpu_cluster_name)\n",
" if cpu_cluster and type(cpu_cluster) is AmlCompute:\n",
" print('found compute target: ' + cpu_cluster_name)\n",
"except ComputeTargetException:\n",
" print('creating a new compute target...')\n",
" provisioning_config = AmlCompute.provisioning_configuration(vm_size = cluster_vm_size, \n",
" vm_priority = 'lowpriority', \n",
" min_nodes = cluster_min_nodes, \n",
" max_nodes = cluster_max_nodes)\n",
" cpu_cluster = ComputeTarget.create(ws, cpu_cluster_name, provisioning_config)\n",
" \n",
" # can poll for a minimum number of nodes and for a specific timeout. \n",
" # if no min node count is provided it will use the scale settings for the cluster\n",
" cpu_cluster.wait_for_completion(show_output=True, min_node_count=None, timeout_in_minutes=20)\n",
" \n",
" # For a more detailed view of current Azure Machine Learning Compute status, use get_status()\n",
" print(cpu_cluster.get_status().serialize())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"From this point, you can either upload training data file directly or use Datastore for training data storage\n",
"## Upload training file from local"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"scripts_folder = \"scripts_folder\"\n",
"if not os.path.isdir(scripts_folder):\n",
" os.mkdir(scripts_folder)\n",
"shutil.copy('./train.conf', os.path.join(scripts_folder, 'train.conf'))\n",
"shutil.copy('./binary0.train', os.path.join(scripts_folder, 'binary0.train'))\n",
"shutil.copy('./binary1.train', os.path.join(scripts_folder, 'binary1.train'))\n",
"shutil.copy('./binary0.test', os.path.join(scripts_folder, 'binary0.test'))\n",
"shutil.copy('./binary1.test', os.path.join(scripts_folder, 'binary1.test'))"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"training_data_list=[\"binary0.train\", \"binary1.train\"]\n",
"validation_data_list = [\"binary0.test\", \"binary1.test\"]\n",
"lgbm = LightGBM(source_directory=scripts_folder, \n",
" compute_target=cpu_cluster, \n",
" distributed_training=Mpi(),\n",
" node_count=2,\n",
" lightgbm_config='train.conf',\n",
" data=training_data_list,\n",
" valid=validation_data_list\n",
" )\n",
"experiment_name = 'lightgbm-estimator-test'\n",
"experiment = Experiment(ws, name=experiment_name)\n",
"run = experiment.submit(lgbm, tags={\"test public docker image\": None})\n",
"RunDetails(run).show()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"run.wait_for_completion(show_output=True)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Use data reference"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from azureml.core.datastore import Datastore\n",
"from azureml.data.data_reference import DataReference\n",
"datastore = ws.get_default_datastore()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"datastore.upload(src_dir='.',\n",
" target_path='.',\n",
" show_progress=True)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"training_data_list=[\"binary0.train\", \"binary1.train\"]\n",
"validation_data_list = [\"binary0.test\", \"binary1.test\"]\n",
"lgbm = LightGBM(source_directory='.', \n",
" compute_target=cpu_cluster, \n",
" distributed_training=Mpi(),\n",
" node_count=2,\n",
" inputs=[datastore.as_mount()],\n",
" lightgbm_config='train.conf',\n",
" data=training_data_list,\n",
" valid=validation_data_list\n",
" )\n",
"experiment_name = 'lightgbm-estimator-test'\n",
"experiment = Experiment(ws, name=experiment_name)\n",
"run = experiment.submit(lgbm, tags={\"use datastore.as_mount()\": None})\n",
"RunDetails(run).show()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"run.wait_for_completion(show_output=True)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# uncomment below and run if compute resources are no longer needed\n",
"# cpu_cluster.delete() "
]
}
],
"metadata": {
"authors": [
{
"name": "jingywa"
}
],
"kernelspec": {
"display_name": "Python 3.6",
"language": "python",
"name": "python36"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.9"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
111 changes: 111 additions & 0 deletions contrib/gbdt/lightgbm/train.conf
Original file line number Diff line number Diff line change
@@ -0,0 +1,111 @@
# task type, support train and predict
task = train

# boosting type, support gbdt for now, alias: boosting, boost
boosting_type = gbdt

# application type, support following application
# regression , regression task
# binary , binary classification task
# lambdarank , lambdarank task
# alias: application, app
objective = binary

# eval metrics, support multi metric, delimite by ',' , support following metrics
# l1
# l2 , default metric for regression
# ndcg , default metric for lambdarank
# auc
# binary_logloss , default metric for binary
# binary_error
metric = binary_logloss,auc

# frequence for metric output
metric_freq = 1

# true if need output metric for training data, alias: tranining_metric, train_metric
is_training_metric = true

# number of bins for feature bucket, 255 is a recommend setting, it can save memories, and also has good accuracy.
max_bin = 255

# training data
# if exsting weight file, should name to "binary.train.weight"
# alias: train_data, train
data = binary.train

# validation data, support multi validation data, separated by ','
# if exsting weight file, should name to "binary.test.weight"
# alias: valid, test, test_data,
valid_data = binary.test

# number of trees(iterations), alias: num_tree, num_iteration, num_iterations, num_round, num_rounds
num_trees = 100

# shrinkage rate , alias: shrinkage_rate
learning_rate = 0.1

# number of leaves for one tree, alias: num_leaf
num_leaves = 63

# type of tree learner, support following types:
# serial , single machine version
# feature , use feature parallel to train
# data , use data parallel to train
# voting , use voting based parallel to train
# alias: tree
tree_learner = feature

# number of threads for multi-threading. One thread will use one CPU, defalut is setted to #cpu.
# num_threads = 8

# feature sub-sample, will random select 80% feature to train on each iteration
# alias: sub_feature
feature_fraction = 0.8

# Support bagging (data sub-sample), will perform bagging every 5 iterations
bagging_freq = 5

# Bagging farction, will random select 80% data on bagging
# alias: sub_row
bagging_fraction = 0.8

# minimal number data for one leaf, use this to deal with over-fit
# alias : min_data_per_leaf, min_data
min_data_in_leaf = 50

# minimal sum hessians for one leaf, use this to deal with over-fit
min_sum_hessian_in_leaf = 5.0

# save memory and faster speed for sparse feature, alias: is_sparse
is_enable_sparse = true

# when data is bigger than memory size, set this to true. otherwise set false will have faster speed
# alias: two_round_loading, two_round
use_two_round_loading = false

# true if need to save data to binary file and application will auto load data from binary file next time
# alias: is_save_binary, save_binary
is_save_binary_file = false

# output model file
output_model = LightGBM_model.txt

# support continuous train from trained gbdt model
# input_model= trained_model.txt

# output prediction file for predict task
# output_result= prediction.txt

# support continuous train from initial score file
# input_init_score= init_score.txt


# number of machines in parallel training, alias: num_machine
num_machines = 2

# local listening port in parallel training, alias: local_port
local_listen_port = 12400

# machines list file for parallel training, alias: mlist
machine_list_file = mlist.txt