[SPARK-20114][ML] spark.ml parity for sequential pattern mining - PrefixSpan #20973

WeichenXu123 · 2018-04-04T04:44:47Z

What changes were proposed in this pull request?

PrefixSpan API for spark.ml. New implementation instead of #20810

How was this patch tested?

TestSuite added.

SparkQA · 2018-04-04T05:51:21Z

Test build #88873 has finished for PR 20973 at commit d563c8f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-04-04T10:47:57Z

Test build #88885 has finished for PR 20973 at commit bd0ce07.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class PrefixSpanSuite extends MLTest

jkbradley

Thanks for the PR!

For the tests, can you please structure them as follows?

If a test dataset is only used in one test, it's fine to put it in the test() call itself, rather than in a shared location.
If you copied a test from spark.mllib, please copy other info for that test, such as R code to reproduce the expected results.

Thanks!

jkbradley · 2018-04-24T20:09:37Z

mllib/src/main/scala/org/apache/spark/ml/fpm/PrefixSpan.scala

We never want to use default arguments in Scala APIs since they are not Java-friendly. Let's just state recommended values in the docstrings. We can add defaults when we create an Estimator.

jkbradley · 2018-04-24T20:11:19Z

mllib/src/main/scala/org/apache/spark/ml/fpm/PrefixSpan.scala

Let's fix this phrasing by just saying "the maximal length of the sequential pattern" (The other part does not make sense: "any pattern that appears...") Feel free to fix that in the old API doc too.

jkbradley · 2018-04-24T20:13:22Z

mllib/src/main/scala/org/apache/spark/ml/fpm/PrefixSpan.scala

Be very explicit about the output schema please: For each column, provide the name and DataType.

jkbradley · 2018-04-24T20:13:55Z

mllib/src/main/scala/org/apache/spark/ml/fpm/PrefixSpan.scala

rename: findFrequentSequentPatterns -> findFrequentSequentialPatterns

jkbradley · 2018-04-24T20:15:12Z

mllib/src/main/scala/org/apache/spark/ml/fpm/PrefixSpan.scala

We don't really need this handlePersistence logic here since it's handled by the spark.mllib implementation.

jkbradley · 2018-04-24T20:15:46Z

mllib/src/main/scala/org/apache/spark/ml/fpm/PrefixSpan.scala

Let's check the input schema and throw a clear exception if it's not OK.

jkbradley · 2018-04-24T20:16:27Z

mllib/src/main/scala/org/apache/spark/ml/fpm/PrefixSpan.scala

It'd be nice to document that rows with nulls in this column are ignored.

You could add a unit test for that too.

jkbradley · 2018-04-24T20:18:55Z

mllib/src/main/scala/org/apache/spark/ml/fpm/PrefixSpan.scala

nit: I'd prefer to call the column "frequency"

SparkQA · 2018-04-25T00:56:42Z

Test build #4158 has finished for PR 20973 at commit bd0ce07.

This patch passes all tests.
This patch does not merge cleanly.
This patch adds the following public classes (experimental):
class PrefixSpanSuite extends MLTest

SparkQA · 2018-04-25T10:03:47Z

Test build #89836 has finished for PR 20973 at commit dc7d779.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-04-25T10:12:02Z

Test build #89837 has finished for PR 20973 at commit acbf9e4.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

jkbradley

Thanks for the updates! They look good. I just noticed one small mistake I made in the original review.

jkbradley · 2018-04-30T17:47:40Z

mllib/src/main/scala/org/apache/spark/ml/fpm/PrefixSpan.scala

+   * @return A `DataFrame` that contains columns of sequence and corresponding frequency.
+   *         The schema of it will be:
+   *          - `sequence: Seq[Seq[T]]` (T is the item type)
+   *          - `frequency: Long`


I had asked for this change to "frequency" from "freq," but I belatedly realized that this conflicts with the existing FPGrowth API, which uses "freq." It would be best to maintain consistency. Would you mind reverting to "freq?"

jkbradley · 2018-05-01T16:24:37Z

LGTM pending jenkins tests

SparkQA · 2018-05-01T17:27:41Z

Test build #4162 has finished for PR 20973 at commit 76d4119.

This patch fails SparkR unit tests.
This patch merges cleanly.
This patch adds no public classes.

jkbradley · 2018-05-01T20:26:15Z

Rerunning tests in case the R CRAN failure was from flakiness

SparkQA · 2018-05-01T21:28:45Z

Test build #4165 has finished for PR 20973 at commit 76d4119.

This patch fails SparkR unit tests.
This patch merges cleanly.
This patch adds no public classes.

WeichenXu123 · 2018-05-02T09:40:49Z

Jenkins, test this please.

SparkQA · 2018-05-02T10:46:46Z

Test build #90040 has finished for PR 20973 at commit 76d4119.

This patch fails SparkR unit tests.
This patch merges cleanly.
This patch adds no public classes.

WeichenXu123 · 2018-05-03T10:15:31Z

Jenkins, test this please.

SparkQA · 2018-05-03T11:27:33Z

Test build #90116 has finished for PR 20973 at commit 76d4119.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

jkbradley · 2018-05-07T21:56:32Z

Merging with master
Thanks @WeichenXu123 !

mengxr · 2018-05-09T05:05:24Z

mllib/src/main/scala/org/apache/spark/ml/fpm/PrefixSpan.scala

+  @Since("2.4.0")
+  def findFrequentSequentialPatterns(
+      dataset: Dataset[_],
+      sequenceCol: String,


@WeichenXu123 @jkbradley The static method doesn't scale with parameters. If we add a new param, we have to keep the old one for binary compatibility. Why not using setters? I think we only need to avoid using fit and transform names.

I agree with using setters. @jkbradley What do you think of it ?

I agree in general, but I don’t think it’s a big deal for PrefixSpan. I think of our current static method as a temporary workaround until we do the work to build a Model which can make meaningful predictions. This will mean that further PrefixSpan improvements may be blocked on this Model work, but I think that’s OK since predictions should be the next priority for PrefixSpan. Once we have a Model, I recommend we deprecate the current static method.

I'm also OK with changing this to use setters, but then we should name it something else so that we can replace it with an Estimator + Model pair later on. I'd suggest "PrefixSpanBuilder."

It should be easier to keep the PrefixSpan name and make it an Estimator later. For example:

final class PrefixSpan(override val uid: String) extends Params { // param, setters, getters def findFrequentSequentialPatterns(dataset: Dataset[_]): DataFrame }

Later we can add Estimator.fit and PrefixSpanModel.transform. Any issue with this approach?

this way final class PrefixSpan(override val uid: String) extends Params seemingly breaks binary compatibility if later we change it into an estimator ?

Adding extends Estimator later should only introduce new methods to the class but no breaking changes.

Oh, I think you're right @mengxr . That approach sounds good.

@WeichenXu123 Do you have time to send a PR to update this API?

Sure. Will update soon!

jkbradley reviewed Apr 24, 2018

View reviewed changes

WeichenXu123 added 3 commits April 25, 2018 16:57

init pr

2c5a3ac

add test

0c3eb12

address comments

acbf9e4

WeichenXu123 force-pushed the prefixSpan2 branch from dc7d779 to acbf9e4 Compare April 25, 2018 08:58

jkbradley reviewed Apr 30, 2018

View reviewed changes

revert freq name

76d4119

asfgit closed this in 76ecd09 May 7, 2018

mengxr reviewed May 9, 2018

View reviewed changes

jkbradley mentioned this pull request May 10, 2018

[SPARK-24146][PySpark][ML] spark.ml parity for sequential pattern mining - PrefixSpan: Python API #21265

Closed

WeichenXu123 mentioned this pull request May 22, 2018

[SPARK-20114][ML][FOLLOW-UP] spark.ml parity for sequential pattern mining - PrefixSpan #21393

Closed

WeichenXu123 deleted the prefixSpan2 branch May 22, 2018 11:55

[SPARK-20114][ML] spark.ml parity for sequential pattern mining - PrefixSpan #20973

[SPARK-20114][ML] spark.ml parity for sequential pattern mining - PrefixSpan #20973

Uh oh!

Conversation

WeichenXu123 commented Apr 4, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Apr 4, 2018

Uh oh!

SparkQA commented Apr 4, 2018

Uh oh!

jkbradley left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Apr 25, 2018

Uh oh!

SparkQA commented Apr 25, 2018

Uh oh!

SparkQA commented Apr 25, 2018

Uh oh!

jkbradley left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jkbradley commented May 1, 2018

Uh oh!

SparkQA commented May 1, 2018

Uh oh!

jkbradley commented May 1, 2018

Uh oh!

SparkQA commented May 1, 2018

Uh oh!

WeichenXu123 commented May 2, 2018

Uh oh!

SparkQA commented May 2, 2018

Uh oh!

WeichenXu123 commented May 3, 2018

Uh oh!

SparkQA commented May 3, 2018

Uh oh!

jkbradley commented May 7, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

WeichenXu123 commented Apr 4, 2018 •

edited

Loading

jkbradley commented May 7, 2018 •

edited

Loading