Skip to content
Closed
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
31 commits
Select commit Hold shift + click to select a range
ce73c63
added Bernoulli option to niave bayes model in mllib, added optional …
leahmcguire Jan 16, 2015
4a3676d
Updated changes re-comments. Got rid of verbose populateMatrix method…
leahmcguire Jan 21, 2015
0313c0c
fixed style error in NaiveBayes.scala
leahmcguire Jan 21, 2015
76e5b0f
removed unnecessary sort from test
leahmcguire Jan 26, 2015
d9477ed
removed old inaccurate comment from test suite for mllib naive bayes
leahmcguire Feb 26, 2015
3891bf2
synced with apache spark and resolved merge conflict
leahmcguire Feb 27, 2015
5a4a534
fixed scala style error in NaiveBayes
leahmcguire Feb 27, 2015
b61b5e2
added back compatable constructor to NaiveBayesModel to fix MIMA test…
leahmcguire Mar 2, 2015
3730572
modified NB model type to be more Java-friendly
jkbradley Mar 3, 2015
b93aaf6
Merge pull request #1 from jkbradley/nb-model-type
leahmcguire Mar 5, 2015
7622b0c
added comments and fixed style as per rb
leahmcguire Mar 5, 2015
dc65374
integrated model type fix
leahmcguire Mar 5, 2015
85f298f
Merge remote-tracking branch 'upstream/master'
leahmcguire Mar 5, 2015
e016569
updated test suite with model type fix
leahmcguire Mar 5, 2015
ea09b28
Merge remote-tracking branch 'upstream/master'
leahmcguire Mar 5, 2015
900b586
fixed model call so that uses type argument
leahmcguire Mar 5, 2015
b85b0c9
Merge remote-tracking branch 'upstream/master'
leahmcguire Mar 5, 2015
c298e78
fixed scala style errors
leahmcguire Mar 5, 2015
2d0c1ba
fixed typo in NaiveBayes
leahmcguire Mar 5, 2015
e2d925e
fixed nonserializable error that was causing naivebayes test failures
leahmcguire Mar 7, 2015
fb0a5c7
removed typo
leahmcguire Mar 9, 2015
01baad7
made fixes from code review
leahmcguire Mar 11, 2015
bea62af
put back in constructor for NaiveBayes
leahmcguire Mar 12, 2015
18f3219
removed private from naive bayes constructor for lambda only
leahmcguire Mar 12, 2015
a22d670
changed NaiveBayesModel modelType parameter back to NaiveBayes.ModelT…
leahmcguire Mar 17, 2015
852a727
merged with upstream master
leahmcguire Mar 21, 2015
6a8f383
Added new model save/load format 2.0 for NaiveBayesModel after modelT…
jkbradley Mar 22, 2015
9ad89ca
removed old code
jkbradley Mar 22, 2015
2224b15
Merge pull request #2 from jkbradley/leahmcguire-master
leahmcguire Mar 24, 2015
acb69af
removed enum type and replaces all modelType parameters with strings
leahmcguire Mar 28, 2015
f3c8994
changed checks on model type to requires
leahmcguire Mar 31, 2015
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
made fixes from code review
  • Loading branch information
leahmcguire committed Mar 11, 2015
commit 01baad70f44fa12ad37a743d5d0fba861d89f149
4 changes: 2 additions & 2 deletions docs/mllib-naive-bayes.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,12 +15,12 @@ and use it for prediction.
MLlib supports [multinomial naive
Bayes](http://en.wikipedia.org/wiki/Naive_Bayes_classifier#Multinomial_naive_Bayes)
and [Bernoulli naive Bayes] (http://nlp.stanford.edu/IR-book/html/htmledition/the-bernoulli-model-1.html).
Which are typically used for [document classification]
These models are typically used for [document classification]
(http://nlp.stanford.edu/IR-book/html/htmledition/naive-bayes-text-classification-1.html).
Within that context, each observation is a document and each
feature represents a term whose value is the frequency of the term (in multinomial naive Bayes) or
a zero or one indicating whether the term was found in the document (in Bernoulli naive Bayes).
Feature values must be nonnegative.The model type is selected with on optional parameter
Feature values must be nonnegative. The model type is selected with an optional parameter
"Multinomial" or "Bernoulli" with "Multinomial" as the default.
[Additive smoothing](http://en.wikipedia.org/wiki/Lidstone_smoothing) can be used by
setting the parameter $\lambda$ (default to $1.0$). For document classification, the input feature
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -49,15 +49,15 @@ class NaiveBayesModel private[mllib] (
val modelType: String)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be nice to expose this as the enum-like type instead of a String. Does that sound reasonable (since users use it when calling NaiveBayes anyways).

It would be good to avoid using "ModelType.fromString" in the predict() method.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had to change this from the enum like type to the string to fix the unit test failures. An actual enum worked but the substitute that you suggested was throwing an non-serializable error on all of the NaiveBayes tests.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, that may have been because I didn't make those types extend Serializable. Does that work?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep that fixes it :P

extends ClassificationModel with Serializable with Saveable {

def this(labels: Array[Double], pi: Array[Double], theta: Array[Array[Double]]) =
private[mllib] def this(labels: Array[Double], pi: Array[Double], theta: Array[Array[Double]]) =
this(labels, pi, theta, NaiveBayes.Multinomial.toString)

private val brzPi = new BDV[Double](pi)
private val brzTheta = new BDM(theta(0).length, theta.length, theta.flatten).t

// Bernoulli scoring requires log(condprob) if 1 log(1-condprob) if 0
// this precomputes log(1.0 - exp(theta)) and its sum for linear algebra application
// of this condition in predict function
// Bernoulli scoring requires log(condprob) if 1, log(1-condprob) if 0.
// This precomputes log(1.0 - exp(theta)) and its sum which are used for the linear algebra
// application of this condition (in predict function).
private val (brzNegTheta, brzNegThetaSum) = NaiveBayes.ModelType.fromString(modelType) match {
case NaiveBayes.Multinomial => (None, None)
case NaiveBayes.Bernoulli =>
Expand Down Expand Up @@ -186,8 +186,6 @@ class NaiveBayes private (
private var lambda: Double,
private var modelType: NaiveBayes.ModelType) extends Serializable with Logging {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add getModelType method


def this(lambda: Double) = this(lambda, NaiveBayes.Multinomial)

def this() = this(1.0, NaiveBayes.Multinomial)

/** Set the smoothing parameter. Default: 1.0. */
Expand All @@ -202,6 +200,7 @@ class NaiveBayes private (
this
}

def getModelType(): NaiveBayes.ModelType = this.modelType
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Getters normally don't have parentheses in Spark


Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove extra space

/**
* Run the algorithm with the configured parameters on an input RDD of LabeledPoint entries.
Expand Down Expand Up @@ -301,10 +300,9 @@ object NaiveBayes {
* @param lambda The smoothing parameter
*/
def train(input: RDD[LabeledPoint], lambda: Double): NaiveBayesModel = {
new NaiveBayes(lambda).run(input)
new NaiveBayes(lambda, NaiveBayes.Multinomial).run(input)
}


/**
* Trains a Naive Bayes model given an RDD of `(label, features)` pairs.
*
Expand All @@ -327,11 +325,7 @@ object NaiveBayes {
new NaiveBayes(lambda, MODELTYPE.fromString(modelType)).run(input)
}


/**
* Model types supported in Naive Bayes:
* multinomial and Bernoulli currently supported
*/
/** Provides static methods for using ModelType. */
sealed abstract class ModelType

object MODELTYPE {
Expand All @@ -348,10 +342,12 @@ object NaiveBayes {

final val ModelType = MODELTYPE
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add doc, perhaps something like "Provides static methods for using ModelType"


/** Constant for specifying ModelType parameter: multinomial model */
final val Multinomial: ModelType = new ModelType {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add doc, perhaps something like "Constant for specifying ModelType parameter: Multinomial model"

override def toString: String = ModelType.MULTINOMIAL_STRING
}

/** Constant for specifying ModelType parameter: bernoulli model */
final val Bernoulli: ModelType = new ModelType {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add doc, perhaps something like "Constant for specifying ModelType parameter: Bernoulli model"

override def toString: String = ModelType.BERNOULLI_STRING
}
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -58,7 +58,7 @@ object NaiveBayesSuite {
for (i <- 0 until nPoints) yield {
val y = calcLabel(rnd.nextDouble(), _pi)
val xi = dataModel match {
case NaiveBayes.Bernoulli => Array.tabulate[Double] (D) {j =>
case NaiveBayes.Bernoulli => Array.tabulate[Double] (D) { j =>
if (rnd.nextDouble () < _theta(y)(j) ) 1 else 0
}
case NaiveBayes.Multinomial =>
Expand Down Expand Up @@ -118,23 +118,15 @@ class NaiveBayesSuite extends FunSuite with MLlibTestSparkContext {
).map(_.map(math.log))

val testData = NaiveBayesSuite.generateNaiveBayesInput(
pi,
theta,
nPoints,
42,
NaiveBayes.Multinomial)
pi, theta, nPoints, 42, NaiveBayes.Multinomial)
val testRDD = sc.parallelize(testData, 2)
testRDD.cache()

val model = NaiveBayes.train(testRDD, 1.0, "multinomial")
validateModelFit(pi, theta, model)

val validationData = NaiveBayesSuite.generateNaiveBayesInput(
pi,
theta,
nPoints,
17,
NaiveBayes.Multinomial)
pi, theta, nPoints, 17, NaiveBayes.Multinomial)
val validationRDD = sc.parallelize(validationData, 2)

// Test prediction on RDD.
Expand Down