[SPARK-6487][MLlib] Add sequential pattern mining algorithm PrefixSpan to Spark MLlib #7258

zhangjiajin · 2015-07-07T08:06:15Z

Add parallel PrefixSpan algorithm and test file.
Support non-temporal sequences.

jackylk · 2015-07-07T08:24:16Z

mllib/src/main/scala/org/apache/spark/mllib/fpm/Prefixspan.scala

Can you make it generic type so that it can accept not only Array[Int] as input?

The Int data type is small than some other data types, such as String etc. So, inside the algorithm, we use Int data type to get the better performance.

If user use other data type, like String, maybe he will code String to Int, and decode the result to String. The following:

val sequences = Array( Array("3", "1", "3", "4", "5"), Array("2", "3", "1"), Array("3", "4", "4", "3"), Array("1", "3", "4", "5"), Array("2", "4", "1"), Array("6", "5", "3")) val rdd = sc.parallelize(sequences, 2).cache() // create coder and decoder val letterMap = rdd.flatMap(x => x.distinct).distinct().zipWithIndex().mapValues(_.toInt).collect val coder = letterMap.toMap val decoder = letterMap.map(x => (x._2, x._1)).toMap // code val intRdd = rdd.map(x => x.map(y => coder(y))) val prefixspan1 = new Prefixspan(intRdd, 2, 50) val result = prefixspan1.run() // decode val stringResult = result.map(x => (x._1.map(y => decoder(y)), x._2))

I think this is a general job, some other algorithms maybe need this function too, so can we add the coder and decoder as a separate model for all other algorithms ?

FPGrowth accepts generic item type and encode items into indices after the first step. We can do that in a follow-up PR.

The problem definition in the referenced paper suggests this should actually be Array[Array[Item]] for Item a generic type. I agree that this extension can be deferred to a later PR and the current code is fine as is.

Actually can the change Array[Int] -> Array[Array[Int]] be made in this PR since it affects the kind of a parameter exposed in a public API and will be more easy to generalize (for now we can just flatten the array and use the existing implementation)?

@feynmanliang The generalization for non-temporal items would happen in a follow-up PR before 1.5. This is discussed on the JIRA page.

@mengxr This PR actually implements non-temporal (by temporal you mean consecutive correct?) sequential pattern mining, see the L86 and L149. My suggestion was to make the definition of a sequence consistent with the Han et al paper (An itemset is a subset of items. A sequence is an ordered list of itemsets) since this PR currently defines a sequence to be a ordered list of singleton itemsets.

Synced offline with @mengxr, Array[Int] OK for this PR.

OK, It would be fixed in a follow-up PR before 1.5. It's soon.

mengxr · 2015-07-07T14:35:30Z

add to whitelist

mengxr · 2015-07-07T14:35:50Z

@feynmanliang Could you take a look?

SparkQA · 2015-07-07T14:43:14Z

Test build #36686 has finished for PR 7258 at commit 91fd7e6.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class Prefixspan(

mengxr · 2015-07-07T15:01:36Z

mllib/src/main/scala/org/apache/spark/mllib/fpm/Prefixspan.scala

Add ::Experimental:: here and @Experimental to class PrefixSpan.

mengxr · 2015-07-07T15:04:14Z

@zhangjiajin Thanks for making this PR minimal! I made some comments about code style. @feynmanliang will make a pass on the implementation. @jackylk Could you make a pass as well?

feynmanliang · 2015-07-07T19:57:27Z

mllib/src/main/scala/org/apache/spark/mllib/fpm/Prefixspan.scala

We should be consistent with FPGrowth and define minSupport to be a double in [0,1] indicating the proportion of transactions containing the sequential pattern (I realize that Han et al define it as the total count but Wikipedia, which is referenced below, follows the same definition I am suggesting).

This just means that we will need to multiply by sequences.count() before doing any of the filtering by minSupport.

Fixed, change minSupport data type to double in [0,1]

mengxr · 2015-07-09T15:45:57Z

mllib/src/main/scala/org/apache/spark/mllib/fpm/PrefixSpan.scala

This is too expensive (creating many temp objects). You only need the counts, so create an PrimitiveKeyOpenHashMap (https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/util/collection/PrimitiveKeyOpenHashMap.scala) and count one by one.

For example, the sequences is: , the frequent item and counts is: (a:2),(b:2),(c:1),(d:3),(e:2),(f:1),(g:1). I have no idea how to use PrimitiveKeyOpenHashMap.

mengxr · 2015-07-09T15:50:42Z

@zhangjiajin Since you already collected the frequent items (length-1 patterns) to driver, you don't need to keep the RDD of length-1 patterns. When generating the final patterns, recomputing the RDD is more expensive than parallelizing the collected ones.

Another comment is to separate local computation from the distributed ones. It makes the implementation easier to read. We can create a private object called LocalPrefixSpan, with run(sequences: Array[Array[Int]], minCount: Int): Array[(Array[Int], Int)], then put all local methods under this object.

SparkQA · 2015-07-09T15:51:12Z

Test build #36936 has finished for PR 7258 at commit ba5df34.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

zhangjiajin · 2015-07-09T16:22:41Z

@mengxr I don't know why method 2 is projection before filtering. I think the method two is exactly what you want. The only need to add functionality to the current code is step 4 (Do we have enough candidates to distribute the work? If no, go to 1 and generate candidates with length + 1.)

SparkQA · 2015-07-10T13:29:34Z

Test build #37035 has finished for PR 7258 at commit 574e56c.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

feynmanliang · 2015-07-10T23:37:28Z

mllib/src/main/scala/org/apache/spark/mllib/fpm/LocalPrefixSpan.scala

"the key of pair is pattern (a list of elements)," -> "the key of pair is sequential pattern (a list of items),"

zhangjiajin · 2015-07-11T02:24:38Z

@feynmanliang comments: Delete makePrefixProjectedDatabases, move the groupByKey() to the last call in this method (no need to include the two map()s on L161 and L163 since they don't do anything)

Because the pair's key is Array(), groupByKey() don't work well, so, the Array must be converted to seq before groupByKey, and be converted back after groupByKey.

feynmanliang · 2015-07-11T02:58:08Z

@zhangjiajin Yep, you're right. Thanks for pointing it out!

feynmanliang · 2015-07-11T03:01:08Z

LGTM pending tests

SparkQA · 2015-07-11T03:19:36Z

Test build #37073 has finished for PR 7258 at commit ca9c4c8.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

mengxr · 2015-07-11T04:11:16Z

@zhangjiajin Let's merge this version and make improvements in follow-up PRs:

LocalPrefixSpan performance
a. run should output Iterator instead of Array.
b. Local count shouldn't use groupBy, which creates too many arrays. We can use PrimitiveKeyOpenHashMap.
c. We can use list to avoid materialize frequent sequences.
If there are not enough length-1 patterns to distribute the work, we make another pass to find length-2 patterns.
Support non-temporal sequences.

I will make JIRAs for each of them. @feynmanliang will work on 1). @zhangjiajin Could you work on 2) and do some performance benchmark? Thanks!

mengxr · 2015-07-11T04:23:38Z

Merged into master. Thanks for contributing PrefixSpan!

zhangjiajin · 2015-07-11T04:53:15Z

OK

…an to Spark MLlib Add parallel PrefixSpan algorithm and test file. Support non-temporal sequences. Author: zhangjiajin <[email protected]> Author: zhang jiajin <[email protected]> Closes #7258 from zhangjiajin/master and squashes the following commits: ca9c4c8 [zhangjiajin] Modified the code according to the review comments. 574e56c [zhangjiajin] Add new object LocalPrefixSpan, and do some optimization. ba5df34 [zhangjiajin] Fix a Scala style error. 4c60fb3 [zhangjiajin] Fix some Scala style errors. 1dd33ad [zhangjiajin] Modified the code according to the review comments. 89bc368 [zhangjiajin] Fixed a Scala style error. a2eb14c [zhang jiajin] Delete PrefixspanSuite.scala 951fd42 [zhang jiajin] Delete Prefixspan.scala 575995f [zhangjiajin] Modified the code according to the review comments. 91fd7e6 [zhangjiajin] Add new algorithm PrefixSpan and test file.

…efixSpan.

Add new algorithm PrefixSpan and test file.

91fd7e6

jackylk reviewed Jul 7, 2015
View reviewed changes

mengxr reviewed Jul 7, 2015
View reviewed changes

feynmanliang reviewed Jul 7, 2015
View reviewed changes

mengxr reviewed Jul 9, 2015
View reviewed changes

Add new object LocalPrefixSpan, and do some optimization.

574e56c

feynmanliang reviewed Jul 10, 2015
View reviewed changes

Modified the code according to the review comments.

ca9c4c8

zhangjiajin closed this Jul 13, 2015

zhangjiajin added 2 commits July 14, 2015 10:21

Add feature: Collect enough frequent prefixes before projection in Pr…

22b0ef4

…efixSpan.

fix a scala style error.

078d410

zhangjiajin reopened this Jul 15, 2015

zhangjiajin closed this Jul 15, 2015

[SPARK-6487][MLlib] Add sequential pattern mining algorithm PrefixSpan to Spark MLlib #7258

[SPARK-6487][MLlib] Add sequential pattern mining algorithm PrefixSpan to Spark MLlib #7258

Uh oh!

Conversation

zhangjiajin commented Jul 7, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mengxr commented Jul 7, 2015

Uh oh!

mengxr commented Jul 7, 2015

Uh oh!

SparkQA commented Jul 7, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mengxr commented Jul 7, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mengxr commented Jul 9, 2015

Uh oh!

SparkQA commented Jul 9, 2015

Uh oh!

zhangjiajin commented Jul 9, 2015

Uh oh!

SparkQA commented Jul 10, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zhangjiajin commented Jul 11, 2015

Uh oh!

feynmanliang commented Jul 11, 2015

Uh oh!

feynmanliang commented Jul 11, 2015

Uh oh!

SparkQA commented Jul 11, 2015

Uh oh!

mengxr commented Jul 11, 2015

Uh oh!

mengxr commented Jul 11, 2015

Uh oh!

zhangjiajin commented Jul 11, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants