Skip to content
Closed
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
22 commits
Select commit Hold shift + click to select a range
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
comments and doc refine
  • Loading branch information
YY-OnCall committed Mar 15, 2017
commit de1bfc8eb48015ea629dea5bdc72ba913b76d234
11 changes: 6 additions & 5 deletions docs/ml-frequent-pattern-mining.md
Original file line number Diff line number Diff line change
Expand Up @@ -35,15 +35,16 @@ We refer users to the papers for more details.

* `minSupport`: the minimum support for an itemset to be identified as frequent.
For example, if an item appears 3 out of 5 transactions, it has a support of 3/5=0.6.
* `minConfidence`: minimum confidence for generating Association Rule. The parameter has no effect during `fit`, but specify
the minimum confidence for generating association rules from frequent itemsets.
* `numPartitions`: the number of partitions used to distribute the work.
* `minConfidence`: minimum confidence for generating Association Rule. The parameter will not affect the mining
for frequent itemsets,, but specify the minimum confidence for generating association rules from frequent itemsets.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be good to give an example for confidence as well since one has been given for support

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also, there are two commas after itemsets

* `numPartitions`: the number of partitions used to distribute the work. By default the param is not set, and
partition number of the input dataset is used.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the number of partitions of the input dataset


The `FPGrowthModel` provides:

* `freqItemsets`: frequent itemsets in the format of DataFrame("items"[Seq], "freq"[Long])
* `freqItemsets`: frequent itemsets in the format of DataFrame("items"[Array], "freq"[Long])
* `associationRules`: association rules generated with confidence above `minConfidence`, in the format of
DataFrame("antecedent"[Seq], "consequent"[Seq], "confidence"[Double]).
DataFrame("antecedent"[Array], "consequent"[Array], "confidence"[Double]).
* `transform`: The transform method examines the input items against all the association rules and
summarize the consequents as prediction. The prediction column has the same data type as the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this really explains what transform does or maybe it's just me?

I would have said something like:

The transform method will produce predictionCol containing all the consequents of the association rules containing the items in itemsCol as their antecedents. The prediction column...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion. I do wish to have a better illustration here. But the two containing in your version make it not that straightforward, and actually it should be items in itemsCol contains the antecedents for association rules.

I extend it to a longer version,

For each record in itemsCol, the transform method will compare its items against the antecedents of each association rule. If the record contains all the antecedents of a specific association rule, the rule will be considered as applicable and its consequents will be added to the prediction result. The transform method will summarize the consequents from all the applicable rules as prediction.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

even better 👍

input column and does not contain existing items in the input column.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -48,24 +48,20 @@ public static void main(String[] args) {
});
Dataset<Row> itemsDF = spark.createDataFrame(data, schema);

// Learn a mapping from words to Vectors.
FPGrowth fpgrowth = new FPGrowth()
FPGrowthModel model = new FPGrowth()
.setMinSupport(0.5)
.setMinConfidence(0.6);
.setMinConfidence(0.6)
.fit(itemsDF);

FPGrowthModel model = fpgrowth.fit(itemsDF);

// get frequent itemsets.
// Display frequent itemsets.
model.freqItemsets().show();

// get generated association rules.
// Display generated association rules.
model.associationRules().show();

// transform examines the input items against all the association rules and summarize the
// consequents as prediction
Dataset<Row> result = model.transform(itemsDF);

result.show();
model.transform(itemsDF).show();
// $example off$

spark.stop();
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -47,7 +47,6 @@ object FPGrowthExample {
"1 2")
).map(t => t.split(" ")).toDF("features")
Copy link
Member

@BryanCutler BryanCutler Mar 2, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is better to explicitly declare the data instead of manipulating strings, that way it is very clear what the input data is for the example. On second thought, never mind this comment - it's pretty clear the way it is


// Trains a FPGrowth model.
val fpgrowth = new FPGrowth().setMinSupport(0.5).setMinConfidence(0.6)
val model = fpgrowth.fit(dataset)

Expand Down
22 changes: 10 additions & 12 deletions mllib/src/main/scala/org/apache/spark/ml/fpm/FPGrowth.scala
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,6 @@

package org.apache.spark.ml.fpm

import scala.collection.mutable.ArrayBuffer
import scala.reflect.ClassTag

import org.apache.hadoop.fs.Path
Expand All @@ -41,7 +40,7 @@ private[fpm] trait FPGrowthParams extends Params with HasFeaturesCol with HasPre

/**
* Minimal support level of the frequent pattern. [0.0, 1.0]. Any pattern that appears
* more than (minSupport * size-of-the-dataset) times will be output
* more than (minSupport * size-of-the-dataset) times will be output in the frequent itemsets.
* Default: 0.3
* @group param
*/
Expand Down Expand Up @@ -69,8 +68,8 @@ private[fpm] trait FPGrowthParams extends Params with HasFeaturesCol with HasPre
def getNumPartitions: Int = $(numPartitions)

/**
* Minimal confidence for generating Association Rule.
* Note that minConfidence has no effect during fitting.
* Minimal confidence for generating Association Rule. MinConfidence will not affect the mining
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lower case minConfidence?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ping

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

got it.

* for frequent itemsets, but will affect the association rules generation.
* Default: 0.8
* @group param
*/
Expand Down Expand Up @@ -154,7 +153,6 @@ class FPGrowth @Since("2.2.0") (
}
val parentModel = mllibFP.run(items)
val rows = parentModel.freqItemsets.map(f => Row(f.items, f.freq))

val schema = StructType(Seq(
StructField("items", dataset.schema($(featuresCol)).dataType, nullable = false),
StructField("freq", LongType, nullable = false)))
Expand Down Expand Up @@ -183,7 +181,7 @@ object FPGrowth extends DefaultParamsReadable[FPGrowth] {
* :: Experimental ::
* Model fitted by FPGrowth.
*
* @param freqItemsets frequent items in the format of DataFrame("items"[Seq], "freq"[Long])
* @param freqItemsets frequent itemsets in the format of DataFrame("items"[Array], "freq"[Long])
*/
@Since("2.2.0")
@Experimental
Expand Down Expand Up @@ -303,13 +301,13 @@ private[fpm] object AssociationRules {

/**
* Computes the association rules with confidence above minConfidence.
* @param dataset DataFrame("items", "freq") containing frequent itemset obtained from
* algorithms like [[FPGrowth]].
* @param dataset DataFrame("items"[Array], "freq"[Long]) containing frequent itemsets obtained
* from algorithms like [[FPGrowth]].
* @param itemsCol column name for frequent itemsets
* @param freqCol column name for frequent itemsets count
* @param minConfidence minimum confidence for the result association rules
* @return a DataFrame("antecedent", "consequent", "confidence") containing the association
* rules.
* @param freqCol column name for appearance count of the frequent itemsets
* @param minConfidence minimum confidence for generating the association rules
* @return a DataFrame("antecedent"[Array], "consequent"[Array], "confidence"[Double])
* containing the association rules.
*/
def getAssociationRulesFromFP[T: ClassTag](
dataset: Dataset[_],
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@ package org.apache.spark.ml.fpm
import org.apache.spark.SparkFunSuite
import org.apache.spark.ml.util.DefaultReadWriteTest
import org.apache.spark.mllib.util.MLlibTestSparkContext
import org.apache.spark.sql.{DataFrame, Dataset, Row, SparkSession}
import org.apache.spark.sql.{DataFrame, Dataset, SparkSession}
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._

Expand Down Expand Up @@ -91,6 +91,8 @@ class FPGrowthSuite extends SparkFunSuite with MLlibTestSparkContext with Defaul
.setMinConfidence(0.5678)
assert(fpGrowth.getMinSupport === 0.4567)
assert(model.getMinConfidence === 0.5678)
// numPartitions should not have default value.
assert(fpGrowth.isDefined(fpGrowth.numPartitions) === false)
}

test("read/write") {
Expand Down