Skip to content
Closed
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
[SYSTEMML-234] [SYSTEMML-208] Updating the documentation
  • Loading branch information
Niketan Pansare committed Aug 8, 2016
commit d4aff09030624e0cee8aed269bd20d64f422b8a8
139 changes: 139 additions & 0 deletions docs/algorithms-classification.md
Original file line number Diff line number Diff line change
Expand Up @@ -127,6 +127,15 @@ Eqs. (1) and (2).
### Usage

<div class="codetabs">
<div data-lang="Python" markdown="1">
import SystemML as sml
# C = 1/reg
logistic = sml.mllearn.LogisticRegression(sqlCtx, fit_intercept=True, max_iter=100, max_inner_iter=0, tol=0.000001, C=1.0)
# X_train, y_train and X_test can be NumPy matrices or Pandas DataFrame
y_test = logistic.fit(X_train, y_train).predict(X_test)
# df_train is DataFrame that contains two columns: "features" (of type Vector) and "label". df_test is a DataFrame that contains the column "features"
y_test = logistic.fit(df_train).transform(df_test)
</div>
<div data-lang="Hadoop" markdown="1">
hadoop jar SystemML.jar -f MultiLogReg.dml
-nvargs X=<file>
Expand Down Expand Up @@ -214,6 +223,56 @@ SystemML Language Reference for details.
### Examples

<div class="codetabs">
<div data-lang="Python" markdown="1">
# Scikit-learn way
from sklearn import datasets, neighbors
import SystemML as sml
from pyspark.sql import SQLContext
sqlCtx = SQLContext(sc)
digits = datasets.load_digits()
X_digits = digits.data
y_digits = digits.target + 1
n_samples = len(X_digits)
X_train = X_digits[:.9 * n_samples]
y_train = y_digits[:.9 * n_samples]
X_test = X_digits[.9 * n_samples:]
y_test = y_digits[.9 * n_samples:]
logistic = sml.mllearn.LogisticRegression(sqlCtx)
print('LogisticRegression score: %f' % logistic.fit(X_train, y_train).score(X_test, y_test))

# MLPipeline way
from pyspark.ml import Pipeline
import SystemML as sml
from pyspark.ml.feature import HashingTF, Tokenizer
from pyspark.sql import SQLContext
sqlCtx = SQLContext(sc)
training = sqlCtx.createDataFrame([
(0L, "a b c d e spark", 1.0),
(1L, "b d", 2.0),
(2L, "spark f g h", 1.0),
(3L, "hadoop mapreduce", 2.0),
(4L, "b spark who", 1.0),
(5L, "g d a y", 2.0),
(6L, "spark fly", 1.0),
(7L, "was mapreduce", 2.0),
(8L, "e spark program", 1.0),
(9L, "a e c l", 2.0),
(10L, "spark compile", 1.0),
(11L, "hadoop software", 2.0)
], ["id", "text", "label"])
tokenizer = Tokenizer(inputCol="text", outputCol="words")
hashingTF = HashingTF(inputCol="words", outputCol="features", numFeatures=20)
lr = sml.mllearn.LogisticRegression(sqlCtx)
pipeline = Pipeline(stages=[tokenizer, hashingTF, lr])
model = pipeline.fit(training)
test = sqlCtx.createDataFrame([
(12L, "spark i j k"),
(13L, "l m n"),
(14L, "mapreduce spark"),
(15L, "apache hadoop")], ["id", "text"])
prediction = model.transform(test)
prediction.show()
</div>
<div data-lang="Hadoop" markdown="1">
hadoop jar SystemML.jar -f MultiLogReg.dml
-nvargs X=/user/ml/X.mtx
Expand Down Expand Up @@ -393,6 +452,15 @@ support vector machine (`y` with domain size `2`).
**Binary-Class Support Vector Machines**:

<div class="codetabs">
<div data-lang="Python" markdown="1">
import SystemML as sml
# C = 1/reg
svm = sml.mllearn.SVM(sqlCtx, fit_intercept=True, max_iter=100, tol=0.000001, C=1.0, is_multi_class=False)
# X_train, y_train and X_test can be NumPy matrices or Pandas DataFrame
y_test = svm.fit(X_train, y_train)
# df_train is DataFrame that contains two columns: "features" (of type Vector) and "label". df_test is a DataFrame that contains the column "features"
y_test = svm.fit(df_train)
</div>
<div data-lang="Hadoop" markdown="1">
hadoop jar SystemML.jar -f l2-svm.dml
-nvargs X=<file>
Expand Down Expand Up @@ -428,6 +496,12 @@ support vector machine (`y` with domain size `2`).
**Binary-Class Support Vector Machines Prediction**:

<div class="codetabs">
<div data-lang="Python" markdown="1">
# X_test can be NumPy matrices or Pandas DataFrame
y_test = svm.predict(X_test)
# df_test is a DataFrame that contains the column "features" of type Vector
y_test = svm.transform(df_test)
</div>
<div data-lang="Hadoop" markdown="1">
hadoop jar SystemML.jar -f l2-svm-predict.dml
-nvargs X=<file>
Expand Down Expand Up @@ -630,6 +704,15 @@ class labels.
**Multi-Class Support Vector Machines**:

<div class="codetabs">
<div data-lang="Python" markdown="1">
import SystemML as sml
# C = 1/reg
svm = sml.mllearn.SVM(sqlCtx, fit_intercept=True, max_iter=100, tol=0.000001, C=1.0, is_multi_class=True)
# X_train, y_train and X_test can be NumPy matrices or Pandas DataFrame
y_test = svm.fit(X_train, y_train)
# df_train is DataFrame that contains two columns: "features" (of type Vector) and "label". df_test is a DataFrame that contains the column "features"
y_test = svm.fit(df_train)
</div>
<div data-lang="Hadoop" markdown="1">
hadoop jar SystemML.jar -f m-svm.dml
-nvargs X=<file>
Expand Down Expand Up @@ -665,6 +748,12 @@ class labels.
**Multi-Class Support Vector Machines Prediction**:

<div class="codetabs">
<div data-lang="Python" markdown="1">
# X_test can be NumPy matrices or Pandas DataFrame
y_test = svm.predict(X_test)
# df_test is a DataFrame that contains the column "features" of type Vector
y_test = svm.transform(df_test)
</div>
<div data-lang="Hadoop" markdown="1">
hadoop jar SystemML.jar -f m-svm-predict.dml
-nvargs X=<file>
Expand Down Expand Up @@ -747,6 +836,56 @@ SystemML Language Reference for details.
**Multi-Class Support Vector Machines**:

<div class="codetabs">
<div data-lang="Python" markdown="1">
# Scikit-learn way
from sklearn import datasets, neighbors
import SystemML as sml
from pyspark.sql import SQLContext
sqlCtx = SQLContext(sc)
digits = datasets.load_digits()
X_digits = digits.data
y_digits = digits.target
n_samples = len(X_digits)
X_train = X_digits[:.9 * n_samples]
y_train = y_digits[:.9 * n_samples]
X_test = X_digits[.9 * n_samples:]
y_test = y_digits[.9 * n_samples:]
svm = sml.mllearn.SVM(sqlCtx, is_multi_class=True)
print('LogisticRegression score: %f' % svm.fit(X_train, y_train).score(X_test, y_test))

# MLPipeline way
from pyspark.ml import Pipeline
import SystemML as sml
from pyspark.ml.feature import HashingTF, Tokenizer
from pyspark.sql import SQLContext
sqlCtx = SQLContext(sc)
training = sqlCtx.createDataFrame([
(0L, "a b c d e spark", 1.0),
(1L, "b d", 2.0),
(2L, "spark f g h", 1.0),
(3L, "hadoop mapreduce", 2.0),
(4L, "b spark who", 1.0),
(5L, "g d a y", 2.0),
(6L, "spark fly", 1.0),
(7L, "was mapreduce", 2.0),
(8L, "e spark program", 1.0),
(9L, "a e c l", 2.0),
(10L, "spark compile", 1.0),
(11L, "hadoop software", 2.0)
], ["id", "text", "label"])
tokenizer = Tokenizer(inputCol="text", outputCol="words")
hashingTF = HashingTF(inputCol="words", outputCol="features", numFeatures=20)
svm = sml.mllearn.SVM(sqlCtx, is_multi_class=True)
pipeline = Pipeline(stages=[tokenizer, hashingTF, svm])
model = pipeline.fit(training)
test = sqlCtx.createDataFrame([
(12L, "spark i j k"),
(13L, "l m n"),
(14L, "mapreduce spark"),
(15L, "apache hadoop")], ["id", "text"])
prediction = model.transform(test)
prediction.show()
</div>
<div data-lang="Hadoop" markdown="1">
hadoop jar SystemML.jar -f m-svm.dml
-nvargs X=/user/ml/X.mtx
Expand Down
62 changes: 62 additions & 0 deletions docs/algorithms-regression.md
Original file line number Diff line number Diff line change
Expand Up @@ -80,6 +80,15 @@ efficient when the number of features $m$ is relatively small
**Linear Regression - Direct Solve**:

<div class="codetabs">
<div data-lang="Python" markdown="1">
import SystemML as sml
# C = 1/reg
lr = sml.mllearn.LinearRegression(sqlCtx, fit_intercept=True, max_iter=100, tol=0.000001, C=1.0, solver='direct-solve')
# X_train, y_train and X_test can be NumPy matrices or Pandas DataFrame
y_test = lr.fit(X_train, y_train)
# df_train is DataFrame that contains two columns: "features" (of type Vector) and "label". df_test is a DataFrame that contains the column "features"
y_test = lr.fit(df_train)
</div>
<div data-lang="Hadoop" markdown="1">
hadoop jar SystemML.jar -f LinearRegDS.dml
-nvargs X=<file>
Expand Down Expand Up @@ -111,6 +120,15 @@ efficient when the number of features $m$ is relatively small
**Linear Regression - Conjugate Gradient**:

<div class="codetabs">
<div data-lang="Python" markdown="1">
import SystemML as sml
# C = 1/reg
lr = sml.mllearn.LinearRegression(sqlCtx, fit_intercept=True, max_iter=100, tol=0.000001, C=1.0, solver='newton-cg')
# X_train, y_train and X_test can be NumPy matrices or Pandas DataFrame
y_test = lr.fit(X_train, y_train)
# df_train is DataFrame that contains two columns: "features" (of type Vector) and "label". df_test is a DataFrame that contains the column "features"
y_test = lr.fit(df_train)
</div>
<div data-lang="Hadoop" markdown="1">
hadoop jar SystemML.jar -f LinearRegCG.dml
-nvargs X=<file>
Expand Down Expand Up @@ -196,6 +214,28 @@ SystemML Language Reference for details.
**Linear Regression - Direct Solve**:

<div class="codetabs">
<div data-lang="Python" markdown="1">
import numpy as np
from sklearn import datasets
import SystemML as sml
from pyspark.sql import SQLContext
# Load the diabetes dataset
diabetes = datasets.load_diabetes()
# Use only one feature
diabetes_X = diabetes.data[:, np.newaxis, 2]
# Split the data into training/testing sets
diabetes_X_train = diabetes_X[:-20]
diabetes_X_test = diabetes_X[-20:]
# Split the targets into training/testing sets
diabetes_y_train = diabetes.target[:-20]
diabetes_y_test = diabetes.target[-20:]
# Create linear regression object
regr = sml.mllearn.LinearRegression(sqlCtx, solver='direct-solve')
# Train the model using the training sets
regr.fit(diabetes_X_train, diabetes_y_train)
# The mean square error
print("Residual sum of squares: %.2f" % np.mean((regr.predict(diabetes_X_test) - diabetes_y_test) ** 2))
</div>
<div data-lang="Hadoop" markdown="1">
hadoop jar SystemML.jar -f LinearRegDS.dml
-nvargs X=/user/ml/X.mtx
Expand Down Expand Up @@ -227,6 +267,28 @@ SystemML Language Reference for details.
**Linear Regression - Conjugate Gradient**:

<div class="codetabs">
<div data-lang="Python" markdown="1">
import numpy as np
from sklearn import datasets
import SystemML as sml
from pyspark.sql import SQLContext
# Load the diabetes dataset
diabetes = datasets.load_diabetes()
# Use only one feature
diabetes_X = diabetes.data[:, np.newaxis, 2]
# Split the data into training/testing sets
diabetes_X_train = diabetes_X[:-20]
diabetes_X_test = diabetes_X[-20:]
# Split the targets into training/testing sets
diabetes_y_train = diabetes.target[:-20]
diabetes_y_test = diabetes.target[-20:]
# Create linear regression object
regr = sml.mllearn.LinearRegression(sqlCtx, solver='newton-cg')
# Train the model using the training sets
regr.fit(diabetes_X_train, diabetes_y_train)
# The mean square error
print("Residual sum of squares: %.2f" % np.mean((regr.predict(diabetes_X_test) - diabetes_y_test) ** 2))
</div>
<div data-lang="Hadoop" markdown="1">
hadoop jar SystemML.jar -f LinearRegCG.dml
-nvargs X=/user/ml/X.mtx
Expand Down