Skip to content

Conversation

@WeichenXu123
Copy link
Contributor

@WeichenXu123 WeichenXu123 commented Apr 4, 2018

What changes were proposed in this pull request?

PrefixSpan API for spark.ml. New implementation instead of #20810

How was this patch tested?

TestSuite added.

@SparkQA
Copy link

SparkQA commented Apr 4, 2018

Test build #88873 has finished for PR 20973 at commit d563c8f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 4, 2018

Test build #88885 has finished for PR 20973 at commit bd0ce07.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class PrefixSpanSuite extends MLTest

Copy link
Member

@jkbradley jkbradley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR!

For the tests, can you please structure them as follows?

  • If a test dataset is only used in one test, it's fine to put it in the test() call itself, rather than in a shared location.
  • If you copied a test from spark.mllib, please copy other info for that test, such as R code to reproduce the expected results.

Thanks!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We never want to use default arguments in Scala APIs since they are not Java-friendly. Let's just state recommended values in the docstrings. We can add defaults when we create an Estimator.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's fix this phrasing by just saying "the maximal length of the sequential pattern" (The other part does not make sense: "any pattern that appears...") Feel free to fix that in the old API doc too.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Be very explicit about the output schema please: For each column, provide the name and DataType.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rename: findFrequentSequentPatterns -> findFrequentSequentialPatterns

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't really need this handlePersistence logic here since it's handled by the spark.mllib implementation.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's check the input schema and throw a clear exception if it's not OK.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It'd be nice to document that rows with nulls in this column are ignored.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could add a unit test for that too.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I'd prefer to call the column "frequency"

@SparkQA
Copy link

SparkQA commented Apr 25, 2018

Test build #4158 has finished for PR 20973 at commit bd0ce07.

  • This patch passes all tests.
  • This patch does not merge cleanly.
  • This patch adds the following public classes (experimental):
  • class PrefixSpanSuite extends MLTest

@SparkQA
Copy link

SparkQA commented Apr 25, 2018

Test build #89836 has finished for PR 20973 at commit dc7d779.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 25, 2018

Test build #89837 has finished for PR 20973 at commit acbf9e4.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@jkbradley jkbradley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the updates! They look good. I just noticed one small mistake I made in the original review.

* @return A `DataFrame` that contains columns of sequence and corresponding frequency.
* The schema of it will be:
* - `sequence: Seq[Seq[T]]` (T is the item type)
* - `frequency: Long`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had asked for this change to "frequency" from "freq," but I belatedly realized that this conflicts with the existing FPGrowth API, which uses "freq." It would be best to maintain consistency. Would you mind reverting to "freq?"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure!

@jkbradley
Copy link
Member

LGTM pending jenkins tests

@SparkQA
Copy link

SparkQA commented May 1, 2018

Test build #4162 has finished for PR 20973 at commit 76d4119.

  • This patch fails SparkR unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@jkbradley
Copy link
Member

Rerunning tests in case the R CRAN failure was from flakiness

@SparkQA
Copy link

SparkQA commented May 1, 2018

Test build #4165 has finished for PR 20973 at commit 76d4119.

  • This patch fails SparkR unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@WeichenXu123
Copy link
Contributor Author

Jenkins, test this please.

@SparkQA
Copy link

SparkQA commented May 2, 2018

Test build #90040 has finished for PR 20973 at commit 76d4119.

  • This patch fails SparkR unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@WeichenXu123
Copy link
Contributor Author

Jenkins, test this please.

@SparkQA
Copy link

SparkQA commented May 3, 2018

Test build #90116 has finished for PR 20973 at commit 76d4119.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@jkbradley
Copy link
Member

jkbradley commented May 7, 2018

Merging with master
Thanks @WeichenXu123 !

@asfgit asfgit closed this in 76ecd09 May 7, 2018
@Since("2.4.0")
def findFrequentSequentialPatterns(
dataset: Dataset[_],
sequenceCol: String,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@WeichenXu123 @jkbradley The static method doesn't scale with parameters. If we add a new param, we have to keep the old one for binary compatibility. Why not using setters? I think we only need to avoid using fit and transform names.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with using setters. @jkbradley What do you think of it ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree in general, but I don’t think it’s a big deal for PrefixSpan. I think of our current static method as a temporary workaround until we do the work to build a Model which can make meaningful predictions. This will mean that further PrefixSpan improvements may be blocked on this Model work, but I think that’s OK since predictions should be the next priority for PrefixSpan. Once we have a Model, I recommend we deprecate the current static method.

I'm also OK with changing this to use setters, but then we should name it something else so that we can replace it with an Estimator + Model pair later on. I'd suggest "PrefixSpanBuilder."

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be easier to keep the PrefixSpan name and make it an Estimator later. For example:

final class PrefixSpan(override val uid: String) extends Params {
  // param, setters, getters
  def findFrequentSequentialPatterns(dataset: Dataset[_]): DataFrame
}

Later we can add Estimator.fit and PrefixSpanModel.transform. Any issue with this approach?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this way final class PrefixSpan(override val uid: String) extends Params seemingly breaks binary compatibility if later we change it into an estimator ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding extends Estimator later should only introduce new methods to the class but no breaking changes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I think you're right @mengxr . That approach sounds good.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@WeichenXu123 Do you have time to send a PR to update this API?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. Will update soon!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants