-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-20114][ML] spark.ml parity for sequential pattern mining - PrefixSpan #20973
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #88873 has finished for PR 20973 at commit
|
|
Test build #88885 has finished for PR 20973 at commit
|
jkbradley
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR!
For the tests, can you please structure them as follows?
- If a test dataset is only used in one test, it's fine to put it in the test() call itself, rather than in a shared location.
- If you copied a test from spark.mllib, please copy other info for that test, such as R code to reproduce the expected results.
Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We never want to use default arguments in Scala APIs since they are not Java-friendly. Let's just state recommended values in the docstrings. We can add defaults when we create an Estimator.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's fix this phrasing by just saying "the maximal length of the sequential pattern" (The other part does not make sense: "any pattern that appears...") Feel free to fix that in the old API doc too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Be very explicit about the output schema please: For each column, provide the name and DataType.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rename: findFrequentSequentPatterns -> findFrequentSequentialPatterns
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't really need this handlePersistence logic here since it's handled by the spark.mllib implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's check the input schema and throw a clear exception if it's not OK.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It'd be nice to document that rows with nulls in this column are ignored.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could add a unit test for that too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I'd prefer to call the column "frequency"
|
Test build #4158 has finished for PR 20973 at commit
|
dc7d779 to
acbf9e4
Compare
|
Test build #89836 has finished for PR 20973 at commit
|
|
Test build #89837 has finished for PR 20973 at commit
|
jkbradley
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the updates! They look good. I just noticed one small mistake I made in the original review.
| * @return A `DataFrame` that contains columns of sequence and corresponding frequency. | ||
| * The schema of it will be: | ||
| * - `sequence: Seq[Seq[T]]` (T is the item type) | ||
| * - `frequency: Long` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had asked for this change to "frequency" from "freq," but I belatedly realized that this conflicts with the existing FPGrowth API, which uses "freq." It would be best to maintain consistency. Would you mind reverting to "freq?"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure!
|
LGTM pending jenkins tests |
|
Test build #4162 has finished for PR 20973 at commit
|
|
Rerunning tests in case the R CRAN failure was from flakiness |
|
Test build #4165 has finished for PR 20973 at commit
|
|
Jenkins, test this please. |
|
Test build #90040 has finished for PR 20973 at commit
|
|
Jenkins, test this please. |
|
Test build #90116 has finished for PR 20973 at commit
|
|
Merging with master |
| @Since("2.4.0") | ||
| def findFrequentSequentialPatterns( | ||
| dataset: Dataset[_], | ||
| sequenceCol: String, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@WeichenXu123 @jkbradley The static method doesn't scale with parameters. If we add a new param, we have to keep the old one for binary compatibility. Why not using setters? I think we only need to avoid using fit and transform names.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with using setters. @jkbradley What do you think of it ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree in general, but I don’t think it’s a big deal for PrefixSpan. I think of our current static method as a temporary workaround until we do the work to build a Model which can make meaningful predictions. This will mean that further PrefixSpan improvements may be blocked on this Model work, but I think that’s OK since predictions should be the next priority for PrefixSpan. Once we have a Model, I recommend we deprecate the current static method.
I'm also OK with changing this to use setters, but then we should name it something else so that we can replace it with an Estimator + Model pair later on. I'd suggest "PrefixSpanBuilder."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should be easier to keep the PrefixSpan name and make it an Estimator later. For example:
final class PrefixSpan(override val uid: String) extends Params {
// param, setters, getters
def findFrequentSequentialPatterns(dataset: Dataset[_]): DataFrame
}Later we can add Estimator.fit and PrefixSpanModel.transform. Any issue with this approach?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this way final class PrefixSpan(override val uid: String) extends Params seemingly breaks binary compatibility if later we change it into an estimator ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding extends Estimator later should only introduce new methods to the class but no breaking changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I think you're right @mengxr . That approach sounds good.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@WeichenXu123 Do you have time to send a PR to update this API?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure. Will update soon!
What changes were proposed in this pull request?
PrefixSpan API for spark.ml. New implementation instead of #20810
How was this patch tested?
TestSuite added.