diff --git a/docs/ml-frequent-pattern-mining.md b/docs/ml-frequent-pattern-mining.md index 80c9580c2895..6e6ae410cb7d 100644 --- a/docs/ml-frequent-pattern-mining.md +++ b/docs/ml-frequent-pattern-mining.md @@ -46,6 +46,8 @@ PFP distributes the work of growing FP-trees based on the suffixes of transactio and hence is more scalable than a single-machine implementation. We refer users to the papers for more details. +FP-growth operates on _itemsets_. An itemset is an unordered collection of unique items. Spark does not have a _set_ type, so itemsets are represented as arrays. + `spark.ml`'s FP-growth implementation takes the following (hyper-)parameters: * `minSupport`: the minimum support for an itemset to be identified as frequent. @@ -60,9 +62,15 @@ We refer users to the papers for more details. The `FPGrowthModel` provides: -* `freqItemsets`: frequent itemsets in the format of DataFrame("items"[Array], "freq"[Long]) -* `associationRules`: association rules generated with confidence above `minConfidence`, in the format of - DataFrame("antecedent"[Array], "consequent"[Array], "confidence"[Double]). +* `freqItemsets`: frequent itemsets in the format of a DataFrame with the following columns: + - `items: array`: A given itemset. + - `freq: long`: A count of how many times this itemset was seen, given the configured model parameters. +* `associationRules`: association rules generated with confidence above `minConfidence`, in the format of a DataFrame with the following columns: + - `antecedent: array`: The itemset that is the hypothesis of the association rule. + - `consequent: array`: An itemset that always contains a single element representing the conclusion of the association rule. + - `confidence: double`: Refer to `minConfidence` above for a definition of `confidence`. + - `lift: double`: A measure of how well the antecedent predicts the consequent, calculated as `support(antecedent U consequent) / (support(antecedent) x support(consequent))` + - `support: double`: Refer to `minSupport` above for a definition of `support`. * `transform`: For each transaction in `itemsCol`, the `transform` method will compare its items against the antecedents of each association rule. If the record contains all the antecedents of a specific association rule, the rule will be considered as applicable and its consequents will be added to the prediction result. The transform