-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-7242][SQL][MLLIB] Frequent items for DataFrames #5799
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 1 commit
3d82168
8279d4d
38e784d
482e741
3a5c177
0915e23
39b1bba
a6ec82c
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
- Loading branch information
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -33,7 +33,8 @@ final class DataFrameStatFunctions private[sql](df: DataFrame) { | |
| * [[http://dx.doi.org/10.1145/762471.762473, proposed by Karp, Schenker, and Papadimitriou]]. | ||
| * | ||
| * @param cols the names of the columns to search frequent items in | ||
| * @param support The minimum frequency for an item to be considered `frequent` | ||
| * @param support The minimum frequency for an item to be considered `frequent` Should be greater | ||
| * than 1e-4. | ||
| * @return A Local DataFrame with the Array of frequent items for each column. | ||
| */ | ||
| def freqItems(cols: Seq[String], support: Double): DataFrame = { | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. don't forget to add java.util.List ones
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. also make sure you add a test to the JavaDataFrameSuite
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. mention |
||
|
|
@@ -44,12 +45,39 @@ final class DataFrameStatFunctions private[sql](df: DataFrame) { | |
| * Finding frequent items for columns, possibly with false positives. Using the | ||
| * frequent element count algorithm described in | ||
| * [[http://dx.doi.org/10.1145/762471.762473, proposed by Karp, Schenker, and Papadimitriou]]. | ||
| * Returns items more frequent than 1/1000'th of the time. | ||
| * Returns items more frequent than 1 percent. | ||
| * | ||
| * @param cols the names of the columns to search frequent items in | ||
| * @return A Local DataFrame with the Array of frequent items for each column. | ||
| */ | ||
| def freqItems(cols: Seq[String]): DataFrame = { | ||
| FrequentItems.singlePassFreqItems(df, cols, 0.001) | ||
| FrequentItems.singlePassFreqItems(df, cols, 0.01) | ||
| } | ||
|
|
||
| /** | ||
| * Finding frequent items for columns, possibly with false positives. Using the | ||
| * frequent element count algorithm described in | ||
| * [[http://dx.doi.org/10.1145/762471.762473, proposed by Karp, Schenker, and Papadimitriou]]. | ||
| * | ||
| * @param cols the names of the columns to search frequent items in | ||
| * @param support The minimum frequency for an item to be considered `frequent` Should be greater | ||
| * than 1e-4. | ||
| * @return A Local DataFrame with the Array of frequent items for each column. | ||
| */ | ||
| def freqItems(cols: List[String], support: Double): DataFrame = { | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think we can just use Seq here, since Python has helper functions that can convert List into Seq. |
||
| FrequentItems.singlePassFreqItems(df, cols, support) | ||
| } | ||
|
|
||
| /** | ||
| * Finding frequent items for columns, possibly with false positives. Using the | ||
| * frequent element count algorithm described in | ||
| * [[http://dx.doi.org/10.1145/762471.762473, proposed by Karp, Schenker, and Papadimitriou]]. | ||
| * Returns items more frequent than 1 percent of the time. | ||
| * | ||
| * @param cols the names of the columns to search frequent items in | ||
| * @return A Local DataFrame with the Array of frequent items for each column. | ||
| */ | ||
| def freqItems(cols: List[String]): DataFrame = { | ||
| FrequentItems.singlePassFreqItems(df, cols, 0.01) | ||
| } | ||
| } | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make sure you document the range of support allowed.