Skip to content

Conversation

@zhangjiajin
Copy link
Contributor

Add parallel PrefixSpan algorithm and test file.
Support non-temporal sequences.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you make it generic type so that it can accept not only Array[Int] as input?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Int data type is small than some other data types, such as String etc. So, inside the algorithm, we use Int data type to get the better performance.

If user use other data type, like String, maybe he will code String to Int, and decode the result to String. The following:

val sequences = Array(
  Array("3", "1", "3", "4", "5"),
  Array("2", "3", "1"),
  Array("3", "4", "4", "3"),
  Array("1", "3", "4", "5"),
  Array("2", "4", "1"),
  Array("6", "5", "3"))
val rdd = sc.parallelize(sequences, 2).cache()

// create coder and decoder
val letterMap = rdd.flatMap(x => x.distinct).distinct().zipWithIndex().mapValues(_.toInt).collect
val coder = letterMap.toMap
val decoder = letterMap.map(x => (x._2, x._1)).toMap

// code
val intRdd = rdd.map(x => x.map(y => coder(y)))

val prefixspan1 = new Prefixspan(intRdd, 2, 50)
val result = prefixspan1.run()

// decode
val stringResult = result.map(x => (x._1.map(y => decoder(y)), x._2))

I think this is a general job, some other algorithms maybe need this function too, so can we add the coder and decoder as a separate model for all other algorithms ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FPGrowth accepts generic item type and encode items into indices after the first step. We can do that in a follow-up PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem definition in the referenced paper suggests this should actually be Array[Array[Item]] for Item a generic type. I agree that this extension can be deferred to a later PR and the current code is fine as is.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually can the change Array[Int] -> Array[Array[Int]] be made in this PR since it affects the kind of a parameter exposed in a public API and will be more easy to generalize (for now we can just flatten the array and use the existing implementation)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@feynmanliang The generalization for non-temporal items would happen in a follow-up PR before 1.5. This is discussed on the JIRA page.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mengxr This PR actually implements non-temporal (by temporal you mean consecutive correct?) sequential pattern mining, see the L86 and L149. My suggestion was to make the definition of a sequence consistent with the Han et al paper (An itemset is a subset of items. A sequence is an ordered list of itemsets) since this PR currently defines a sequence to be a ordered list of singleton itemsets.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Synced offline with @mengxr, Array[Int] OK for this PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, It would be fixed in a follow-up PR before 1.5. It's soon.

@mengxr
Copy link
Contributor

mengxr commented Jul 7, 2015

add to whitelist

@mengxr
Copy link
Contributor

mengxr commented Jul 7, 2015

@feynmanliang Could you take a look?

@SparkQA
Copy link

SparkQA commented Jul 7, 2015

Test build #36686 has finished for PR 7258 at commit 91fd7e6.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class Prefixspan(

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add ::Experimental:: here and @Experimental to class PrefixSpan.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

@mengxr
Copy link
Contributor

mengxr commented Jul 7, 2015

@zhangjiajin Thanks for making this PR minimal! I made some comments about code style. @feynmanliang will make a pass on the implementation. @jackylk Could you make a pass as well?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should be consistent with FPGrowth and define minSupport to be a double in [0,1] indicating the proportion of transactions containing the sequential pattern (I realize that Han et al define it as the total count but Wikipedia, which is referenced below, follows the same definition I am suggesting).

This just means that we will need to multiply by sequences.count() before doing any of the filtering by minSupport.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed, change minSupport data type to double in [0,1]

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is too expensive (creating many temp objects). You only need the counts, so create an PrimitiveKeyOpenHashMap (https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/util/collection/PrimitiveKeyOpenHashMap.scala) and count one by one.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For example, the sequences is: , the frequent item and counts is: (a:2),(b:2),(c:1),(d:3),(e:2),(f:1),(g:1). I have no idea how to use PrimitiveKeyOpenHashMap.

@mengxr
Copy link
Contributor

mengxr commented Jul 9, 2015

@zhangjiajin Since you already collected the frequent items (length-1 patterns) to driver, you don't need to keep the RDD of length-1 patterns. When generating the final patterns, recomputing the RDD is more expensive than parallelizing the collected ones.

Another comment is to separate local computation from the distributed ones. It makes the implementation easier to read. We can create a private object called LocalPrefixSpan, with run(sequences: Array[Array[Int]], minCount: Int): Array[(Array[Int], Int)], then put all local methods under this object.

@SparkQA
Copy link

SparkQA commented Jul 9, 2015

Test build #36936 has finished for PR 7258 at commit ba5df34.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@zhangjiajin
Copy link
Contributor Author

@mengxr I don't know why method 2 is projection before filtering. I think the method two is exactly what you want. The only need to add functionality to the current code is step 4 (Do we have enough candidates to distribute the work? If no, go to 1 and generate candidates with length + 1.)

image

image

@SparkQA
Copy link

SparkQA commented Jul 10, 2015

Test build #37035 has finished for PR 7258 at commit 574e56c.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"the key of pair is pattern (a list of elements)," -> "the key of pair is sequential pattern (a list of items),"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@zhangjiajin
Copy link
Contributor Author

@feynmanliang comments: Delete makePrefixProjectedDatabases, move the groupByKey() to the last call in this method (no need to include the two map()s on L161 and L163 since they don't do anything)

Because the pair's key is Array(), groupByKey() don't work well, so, the Array must be converted to seq before groupByKey, and be converted back after groupByKey.

@feynmanliang
Copy link
Contributor

@zhangjiajin Yep, you're right. Thanks for pointing it out!

@feynmanliang
Copy link
Contributor

LGTM pending tests

@SparkQA
Copy link

SparkQA commented Jul 11, 2015

Test build #37073 has finished for PR 7258 at commit ca9c4c8.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@mengxr
Copy link
Contributor

mengxr commented Jul 11, 2015

@zhangjiajin Let's merge this version and make improvements in follow-up PRs:

  1. LocalPrefixSpan performance
    a. run should output Iterator instead of Array.
    b. Local count shouldn't use groupBy, which creates too many arrays. We can use PrimitiveKeyOpenHashMap.
    c. We can use list to avoid materialize frequent sequences.
  2. If there are not enough length-1 patterns to distribute the work, we make another pass to find length-2 patterns.
  3. Support non-temporal sequences.

I will make JIRAs for each of them. @feynmanliang will work on 1). @zhangjiajin Could you work on 2) and do some performance benchmark? Thanks!

@mengxr
Copy link
Contributor

mengxr commented Jul 11, 2015

Merged into master. Thanks for contributing PrefixSpan!

@zhangjiajin
Copy link
Contributor Author

OK

zhangjiajin added a commit that referenced this pull request Jul 11, 2015
…an to Spark MLlib

Add parallel PrefixSpan algorithm and test file.
Support non-temporal sequences.

Author: zhangjiajin <[email protected]>
Author: zhang jiajin <[email protected]>

Closes #7258 from zhangjiajin/master and squashes the following commits:

ca9c4c8 [zhangjiajin] Modified the code according to the review comments.
574e56c [zhangjiajin] Add new object LocalPrefixSpan, and do some optimization.
ba5df34 [zhangjiajin] Fix a Scala style error.
4c60fb3 [zhangjiajin] Fix some Scala style errors.
1dd33ad [zhangjiajin] Modified the code according to the review comments.
89bc368 [zhangjiajin] Fixed a Scala style error.
a2eb14c [zhang jiajin] Delete PrefixspanSuite.scala
951fd42 [zhang jiajin] Delete Prefixspan.scala
575995f [zhangjiajin] Modified the code according to the review comments.
91fd7e6 [zhangjiajin] Add new algorithm PrefixSpan and test file.
@zhangjiajin zhangjiajin reopened this Jul 15, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants