Skip to content
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
73 commits
Select commit Hold shift + click to select a range
e14b545
[SPARK-7977] [BUILD] Disallowing println
jonalter Jul 10, 2015
11e22b7
[SPARK-7944] [SPARK-8013] Remove most of the Spark REPL fork for Scal…
dragos Jul 10, 2015
5dd45bd
[SPARK-8958] Dynamic allocation: change cached timeout to infinity
Jul 10, 2015
db6d57f
[CORE] [MINOR] change the log level to info
chenghao-intel Jul 10, 2015
c185f3a
[SPARK-8675] Executors created by LocalBackend won't get the same cla…
coderplay Jul 10, 2015
05ac023
[HOTFIX] fix flaky test in PySpark SQL
Jul 10, 2015
0772026
[SPARK-8923] [DOCUMENTATION, MLLIB] Add @since tags to mllib.fpm
rahulpalamuttam Jul 10, 2015
fb8807c
[SPARK-7078] [SPARK-7079] Binary processing sort for Spark SQL
JoshRosen Jul 10, 2015
857e325
[SPARK-8990] [SQL] SPARK-8990 DataFrameReader.parquet() should respec…
liancheng Jul 10, 2015
b6fc0ad
add inline comment for python tests
davies Jul 11, 2015
3363088
[SPARK-8961] [SQL] Makes BaseWriterContainer.outputWriterForRow accep…
liancheng Jul 11, 2015
6e1c7e2
[SPARK-7735] [PYSPARK] Raise Exception on non-zero exit from pipe com…
megatron-me-uk Jul 11, 2015
9c50757
[SPARK-8598] [MLLIB] Implementation of 1-sample, two-sided, Kolmogoro…
Jul 11, 2015
7f6be1f
[SPARK-6487] [MLLIB] Add sequential pattern mining algorithm PrefixSp…
zhangjiajin Jul 11, 2015
0c5207c
[SPARK-8994] [ML] tiny cleanups to Params, Pipeline
jkbradley Jul 11, 2015
c472eb1
[SPARK-8970][SQL] remove unnecessary abstraction for ExtractValue
cloud-fan Jul 11, 2015
3009088
[SPARK-8880] Fix confusing Stage.attemptId member variable
kayousterhout Jul 13, 2015
20b4743
[SPARK-9006] [PYSPARK] fix microsecond loss in Python 3
Jul 13, 2015
92540d2
[SPARK-8203] [SPARK-8204] [SQL] conditional function: least/greatest
adrian-wang Jul 13, 2015
6b89943
[SPARK-8944][SQL] Support casting between IntervalType and StringType
cloud-fan Jul 13, 2015
a5bc803
[SPARK-8596] Add module for rstudio link to spark
koaning Jul 13, 2015
7f487c8
[SPARK-6797] [SPARKR] Add support for YARN cluster mode.
Jul 13, 2015
9b62e93
[SPARK-8706] [PYSPARK] [PROJECT INFRA] Add pylint checks to PySpark
MechCoder Jul 13, 2015
5ca26fb
[SPARK-8950] [WEBUI] Correct the calculation of SchedulerDelay in Sta…
carsonwang Jul 13, 2015
79c3582
Revert "[SPARK-8706] [PYSPARK] [PROJECT INFRA] Add pylint checks to P…
davies Jul 13, 2015
5c41691
[SPARK-8954] [BUILD] Remove unneeded deb repository from Dockerfile t…
yongtang Jul 13, 2015
714fc55
[SPARK-8991] [ML] Update SharedParamsCodeGen's Generated Documentation
Jul 13, 2015
4c797f2
[SPARK-8636] [SQL] Fix equalNullSafe comparison
Jul 13, 2015
0aed38e
[SPARK-8533] [STREAMING] Upgrade Flume to 1.6.0
harishreedharan Jul 13, 2015
b7bcbe2
[SPARK-8743] [STREAMING] Deregister Codahale metrics for streaming wh…
Jul 13, 2015
408b384
[SPARK-6910] [SQL] Support for pushing predicates down to metastore f…
Jul 14, 2015
20c1434
[SPARK-9001] Fixing errors in javadocs that lead to failed build/sbt doc
jegonzal Jul 14, 2015
c1feebd
[SPARK-9010] [DOCUMENTATION] Improve the Spark Configuration document…
stanzhai Jul 14, 2015
257236c
[SPARK-6851] [SQL] function least/greatest follow up
adrian-wang Jul 14, 2015
59d820a
[SPARK-9029] [SQL] shortcut CaseKeyWhen if key is null
cloud-fan Jul 14, 2015
37f2d96
[SPARK-9027] [SQL] Generalize metastore predicate pushdown
marmbrus Jul 14, 2015
c4e98ff
[SPARK-8933] [BUILD] Provide a --force flag to build/mvn that always …
Jul 14, 2015
8fb3a65
[SPARK-8911] Fix local mode endless heartbeats
Jul 14, 2015
d267c28
[SPARK-9031] Merge BlockObjectWriter and DiskBlockObject writer to re…
JoshRosen Jul 14, 2015
0a4071e
[SPARK-8718] [GRAPHX] Improve EdgePartition2D for non perfect square …
aray Jul 14, 2015
fb1d06f
[SPARK-4072] [CORE] Display Streaming blocks in Streaming UI
zsxwing Jul 14, 2015
4b5cfc9
[SPARK-8800] [SQL] Fix inaccurate precision/scale of Decimal division…
viirya Jul 14, 2015
740b034
[SPARK-4362] [MLLIB] Make prediction probability available in NaiveBa…
srowen Jul 14, 2015
11e5c37
[SPARK-8962] Add Scalastyle rule to ban direct use of Class.forName; …
JoshRosen Jul 14, 2015
e965a79
[SPARK-9045] Fix Scala 2.11 build break in UnsafeExternalRowSorter
JoshRosen Jul 15, 2015
cc57d70
[SPARK-9050] [SQL] Remove unused newOrdering argument from Exchange (…
JoshRosen Jul 15, 2015
f957796
[SPARK-8820] [STREAMING] Add a configuration to set checkpoint dir.
SaintBacchus Jul 15, 2015
bb870e7
[SPARK-5523] [CORE] [STREAMING] Add a cache for hostname in TaskMetri…
jerryshao Jul 15, 2015
5572fd0
[HOTFIX] Adding new names to known contributors
pwendell Jul 15, 2015
f650a00
[SPARK-8808] [SPARKR] Fix assignments in SparkR.
Jul 15, 2015
f23a721
[SPARK-8993][SQL] More comprehensive type checking in expressions.
rxin Jul 15, 2015
c6b1a9e
Revert SPARK-6910 and SPARK-9027
marmbrus Jul 15, 2015
4692769
[SPARK-6259] [MLLIB] Python API for LDA
yu-iskw Jul 15, 2015
3f6296f
[SPARK-8018] [MLLIB] KMeans should accept initial cluster centers as …
FlytxtRnD Jul 15, 2015
f0e1297
[SPARK-8279][SQL]Add math function round
yjshen Jul 15, 2015
1bb8acc
[SPARK-8997] [MLLIB] Performance improvements in LocalPrefixSpan
Jul 15, 2015
14935d8
[HOTFIX][SQL] Unit test breaking.
rxin Jul 15, 2015
adb33d3
[SPARK-9012] [WEBUI] Escape Accumulators in the task table
zsxwing Jul 15, 2015
20bb10f
[SPARK-8706] [PYSPARK] [PROJECT INFRA] Add pylint checks to PySpark
MechCoder Jul 15, 2015
6f69025
[SPARK-8840] [SPARKR] Add float coercion on SparkR
viirya Jul 15, 2015
fa4ec36
[SPARK-9020][SQL] Support mutable state in code gen expressions
cloud-fan Jul 15, 2015
a938527
[SPARK-8221][SQL]Add pmod function
zhichao-li Jul 15, 2015
9716a72
[Minor][SQL] Allow spaces in the beginning and ending of string for I…
viirya Jul 15, 2015
303c120
[SPARK-7555] [DOCS] Add doc for elastic net in ml-guide and mllib-guide
coderxiang Jul 15, 2015
ec9b621
SPARK-9070 JavaDataFrameSuite teardown NPEs if setup failed
steveloughran Jul 15, 2015
536533c
[SPARK-9005] [MLLIB] Fix RegressionMetrics computation of explainedVa…
Jul 15, 2015
b9a922e
[SPARK-6602][Core]Replace Akka Serialization with Spark Serializer
zsxwing Jul 15, 2015
674eb2a
[SPARK-8974] Catch exceptions in allocation schedule task.
Jul 15, 2015
affbe32
[SPARK-9071][SQL] MonotonicallyIncreasingID and SparkPartitionID shou…
rxin Jul 15, 2015
5599cc4
Predicate pushdown to hive metastore
Jul 15, 2015
b3cb5af
Synchronize getPartitionsByFilter
Jul 17, 2015
acf96d1
Synchronize on hive client
Jul 17, 2015
f897087
Synchronize on this
Jul 17, 2015
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
[SPARK-6487] [MLLIB] Add sequential pattern mining algorithm PrefixSp…
…an to Spark MLlib

Add parallel PrefixSpan algorithm and test file.
Support non-temporal sequences.

Author: zhangjiajin <[email protected]>
Author: zhang jiajin <[email protected]>

Closes apache#7258 from zhangjiajin/master and squashes the following commits:

ca9c4c8 [zhangjiajin] Modified the code according to the review comments.
574e56c [zhangjiajin] Add new object LocalPrefixSpan, and do some optimization.
ba5df34 [zhangjiajin] Fix a Scala style error.
4c60fb3 [zhangjiajin] Fix some Scala style errors.
1dd33ad [zhangjiajin] Modified the code according to the review comments.
89bc368 [zhangjiajin] Fixed a Scala style error.
a2eb14c [zhang jiajin] Delete PrefixspanSuite.scala
951fd42 [zhang jiajin] Delete Prefixspan.scala
575995f [zhangjiajin] Modified the code according to the review comments.
91fd7e6 [zhangjiajin] Add new algorithm PrefixSpan and test file.
  • Loading branch information
zhangjiajin authored and mengxr committed Jul 11, 2015
commit 7f6be1f24d4f2fcb3d3ec181b5abf241709a8b6d
113 changes: 113 additions & 0 deletions mllib/src/main/scala/org/apache/spark/mllib/fpm/LocalPrefixSpan.scala
Original file line number Diff line number Diff line change
@@ -0,0 +1,113 @@
/*
* Licensed to the Apache Software Foundation (ASF) under one or more
* contributor license agreements. See the NOTICE file distributed with
* this work for additional information regarding copyright ownership.
* The ASF licenses this file to You under the Apache License, Version 2.0
* (the "License"); you may not use this file except in compliance with
* the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/

package org.apache.spark.mllib.fpm

import org.apache.spark.Logging
import org.apache.spark.annotation.Experimental

/**
*
* :: Experimental ::
*
* Calculate all patterns of a projected database in local.
*/
@Experimental
private[fpm] object LocalPrefixSpan extends Logging with Serializable {

/**
* Calculate all patterns of a projected database.
* @param minCount minimum count
* @param maxPatternLength maximum pattern length
* @param prefix prefix
* @param projectedDatabase the projected dabase
* @return a set of sequential pattern pairs,
* the key of pair is sequential pattern (a list of items),
* the value of pair is the pattern's count.
*/
def run(
minCount: Long,
maxPatternLength: Int,
prefix: Array[Int],
projectedDatabase: Array[Array[Int]]): Array[(Array[Int], Long)] = {
val frequentPrefixAndCounts = getFreqItemAndCounts(minCount, projectedDatabase)
val frequentPatternAndCounts = frequentPrefixAndCounts
.map(x => (prefix ++ Array(x._1), x._2))
val prefixProjectedDatabases = getPatternAndProjectedDatabase(
prefix, frequentPrefixAndCounts.map(_._1), projectedDatabase)

val continueProcess = prefixProjectedDatabases.nonEmpty && prefix.length + 1 < maxPatternLength
if (continueProcess) {
val nextPatterns = prefixProjectedDatabases
.map(x => run(minCount, maxPatternLength, x._1, x._2))
.reduce(_ ++ _)
frequentPatternAndCounts ++ nextPatterns
} else {
frequentPatternAndCounts
}
}

/**
* calculate suffix sequence following a prefix in a sequence
* @param prefix prefix
* @param sequence sequence
* @return suffix sequence
*/
def getSuffix(prefix: Int, sequence: Array[Int]): Array[Int] = {
val index = sequence.indexOf(prefix)
if (index == -1) {
Array()
} else {
sequence.drop(index + 1)
}
}

/**
* Generates frequent items by filtering the input data using minimal count level.
* @param minCount the absolute minimum count
* @param sequences sequences data
* @return array of item and count pair
*/
private def getFreqItemAndCounts(
minCount: Long,
sequences: Array[Array[Int]]): Array[(Int, Long)] = {
sequences.flatMap(_.distinct)
.groupBy(x => x)
.mapValues(_.length.toLong)
.filter(_._2 >= minCount)
.toArray
}

/**
* Get the frequent prefixes' projected database.
* @param prePrefix the frequent prefixes' prefix
* @param frequentPrefixes frequent prefixes
* @param sequences sequences data
* @return prefixes and projected database
*/
private def getPatternAndProjectedDatabase(
prePrefix: Array[Int],
frequentPrefixes: Array[Int],
sequences: Array[Array[Int]]): Array[(Array[Int], Array[Array[Int]])] = {
val filteredProjectedDatabase = sequences
.map(x => x.filter(frequentPrefixes.contains(_)))
frequentPrefixes.map { x =>
val sub = filteredProjectedDatabase.map(y => getSuffix(x, y)).filter(_.nonEmpty)
(prePrefix ++ Array(x), sub)
}.filter(x => x._2.nonEmpty)
}
}
157 changes: 157 additions & 0 deletions mllib/src/main/scala/org/apache/spark/mllib/fpm/PrefixSpan.scala
Original file line number Diff line number Diff line change
@@ -0,0 +1,157 @@
/*
* Licensed to the Apache Software Foundation (ASF) under one or more
* contributor license agreements. See the NOTICE file distributed with
* this work for additional information regarding copyright ownership.
* The ASF licenses this file to You under the Apache License, Version 2.0
* (the "License"); you may not use this file except in compliance with
* the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/

package org.apache.spark.mllib.fpm

import org.apache.spark.Logging
import org.apache.spark.annotation.Experimental
import org.apache.spark.rdd.RDD
import org.apache.spark.storage.StorageLevel

/**
*
* :: Experimental ::
*
* A parallel PrefixSpan algorithm to mine sequential pattern.
* The PrefixSpan algorithm is described in
* [[http://doi.org/10.1109/ICDE.2001.914830]].
*
* @param minSupport the minimal support level of the sequential pattern, any pattern appears
* more than (minSupport * size-of-the-dataset) times will be output
* @param maxPatternLength the maximal length of the sequential pattern, any pattern appears
* less than maxPatternLength will be output
*
* @see [[https://en.wikipedia.org/wiki/Sequential_Pattern_Mining Sequential Pattern Mining
* (Wikipedia)]]
*/
@Experimental
class PrefixSpan private (
private var minSupport: Double,
private var maxPatternLength: Int) extends Logging with Serializable {

/**
* Constructs a default instance with default parameters
* {minSupport: `0.1`, maxPatternLength: `10`}.
*/
def this() = this(0.1, 10)

/**
* Sets the minimal support level (default: `0.1`).
*/
def setMinSupport(minSupport: Double): this.type = {
require(minSupport >= 0 && minSupport <= 1,
"The minimum support value must be between 0 and 1, including 0 and 1.")
this.minSupport = minSupport
this
}

/**
* Sets maximal pattern length (default: `10`).
*/
def setMaxPatternLength(maxPatternLength: Int): this.type = {
require(maxPatternLength >= 1,
"The maximum pattern length value must be greater than 0.")
this.maxPatternLength = maxPatternLength
this
}

/**
* Find the complete set of sequential patterns in the input sequences.
* @param sequences input data set, contains a set of sequences,
* a sequence is an ordered list of elements.
* @return a set of sequential pattern pairs,
* the key of pair is pattern (a list of elements),
* the value of pair is the pattern's count.
*/
def run(sequences: RDD[Array[Int]]): RDD[(Array[Int], Long)] = {
if (sequences.getStorageLevel == StorageLevel.NONE) {
logWarning("Input data is not cached.")
}
val minCount = getMinCount(sequences)
val lengthOnePatternsAndCounts =
getFreqItemAndCounts(minCount, sequences).collect()
val prefixAndProjectedDatabase = getPrefixAndProjectedDatabase(
lengthOnePatternsAndCounts.map(_._1), sequences)
val groupedProjectedDatabase = prefixAndProjectedDatabase
.map(x => (x._1.toSeq, x._2))
.groupByKey()
.map(x => (x._1.toArray, x._2.toArray))
val nextPatterns = getPatternsInLocal(minCount, groupedProjectedDatabase)
val lengthOnePatternsAndCountsRdd =
sequences.sparkContext.parallelize(
lengthOnePatternsAndCounts.map(x => (Array(x._1), x._2)))
val allPatterns = lengthOnePatternsAndCountsRdd ++ nextPatterns
allPatterns
}

/**
* Get the minimum count (sequences count * minSupport).
* @param sequences input data set, contains a set of sequences,
* @return minimum count,
*/
private def getMinCount(sequences: RDD[Array[Int]]): Long = {
if (minSupport == 0) 0L else math.ceil(sequences.count() * minSupport).toLong
}

/**
* Generates frequent items by filtering the input data using minimal count level.
* @param minCount the absolute minimum count
* @param sequences original sequences data
* @return array of item and count pair
*/
private def getFreqItemAndCounts(
minCount: Long,
sequences: RDD[Array[Int]]): RDD[(Int, Long)] = {
sequences.flatMap(_.distinct.map((_, 1L)))
.reduceByKey(_ + _)
.filter(_._2 >= minCount)
}

/**
* Get the frequent prefixes' projected database.
* @param frequentPrefixes frequent prefixes
* @param sequences sequences data
* @return prefixes and projected database
*/
private def getPrefixAndProjectedDatabase(
frequentPrefixes: Array[Int],
sequences: RDD[Array[Int]]): RDD[(Array[Int], Array[Int])] = {
val filteredSequences = sequences.map { p =>
p.filter (frequentPrefixes.contains(_) )
}
filteredSequences.flatMap { x =>
frequentPrefixes.map { y =>
val sub = LocalPrefixSpan.getSuffix(y, x)
(Array(y), sub)
}.filter(_._2.nonEmpty)
}
}

/**
* calculate the patterns in local.
* @param minCount the absolute minimum count
* @param data patterns and projected sequences data data
* @return patterns
*/
private def getPatternsInLocal(
minCount: Long,
data: RDD[(Array[Int], Array[Array[Int]])]): RDD[(Array[Int], Long)] = {
data.flatMap { x =>
LocalPrefixSpan.run(minCount, maxPatternLength, x._1, x._2)
}
}
}
120 changes: 120 additions & 0 deletions mllib/src/test/scala/org/apache/spark/mllib/fpm/PrefixSpanSuite.scala
Original file line number Diff line number Diff line change
@@ -0,0 +1,120 @@
/*
* Licensed to the Apache Software Foundation (ASF) under one or more
* contributor license agreements. See the NOTICE file distributed with
* this work for additional information regarding copyright ownership.
* The ASF licenses this file to You under the Apache License, Version 2.0
* (the "License"); you may not use this file except in compliance with
* the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
package org.apache.spark.mllib.fpm

import org.apache.spark.SparkFunSuite
import org.apache.spark.mllib.util.MLlibTestSparkContext
import org.apache.spark.rdd.RDD

class PrefixspanSuite extends SparkFunSuite with MLlibTestSparkContext {

test("PrefixSpan using Integer type") {

/*
library("arulesSequences")
prefixSpanSeqs = read_baskets("prefixSpanSeqs", info = c("sequenceID","eventID","SIZE"))
freqItemSeq = cspade(
prefixSpanSeqs,
parameter = list(support =
2 / length(unique(transactionInfo(prefixSpanSeqs)$sequenceID)), maxlen = 2 ))
resSeq = as(freqItemSeq, "data.frame")
resSeq
*/

val sequences = Array(
Array(1, 3, 4, 5),
Array(2, 3, 1),
Array(2, 4, 1),
Array(3, 1, 3, 4, 5),
Array(3, 4, 4, 3),
Array(6, 5, 3))

val rdd = sc.parallelize(sequences, 2).cache()

def compareResult(
expectedValue: Array[(Array[Int], Long)],
actualValue: Array[(Array[Int], Long)]): Boolean = {
val sortedExpectedValue = expectedValue.sortWith{ (x, y) =>
x._1.mkString(",") + ":" + x._2 < y._1.mkString(",") + ":" + y._2
}
val sortedActualValue = actualValue.sortWith{ (x, y) =>
x._1.mkString(",") + ":" + x._2 < y._1.mkString(",") + ":" + y._2
}
sortedExpectedValue.zip(sortedActualValue)
.map(x => x._1._1.mkString(",") == x._2._1.mkString(",") && x._1._2 == x._2._2)
.reduce(_&&_)
}

val prefixspan = new PrefixSpan()
.setMinSupport(0.33)
.setMaxPatternLength(50)
val result1 = prefixspan.run(rdd)
val expectedValue1 = Array(
(Array(1), 4L),
(Array(1, 3), 2L),
(Array(1, 3, 4), 2L),
(Array(1, 3, 4, 5), 2L),
(Array(1, 3, 5), 2L),
(Array(1, 4), 2L),
(Array(1, 4, 5), 2L),
(Array(1, 5), 2L),
(Array(2), 2L),
(Array(2, 1), 2L),
(Array(3), 5L),
(Array(3, 1), 2L),
(Array(3, 3), 2L),
(Array(3, 4), 3L),
(Array(3, 4, 5), 2L),
(Array(3, 5), 2L),
(Array(4), 4L),
(Array(4, 5), 2L),
(Array(5), 3L)
)
assert(compareResult(expectedValue1, result1.collect()))

prefixspan.setMinSupport(0.5).setMaxPatternLength(50)
val result2 = prefixspan.run(rdd)
val expectedValue2 = Array(
(Array(1), 4L),
(Array(3), 5L),
(Array(3, 4), 3L),
(Array(4), 4L),
(Array(5), 3L)
)
assert(compareResult(expectedValue2, result2.collect()))

prefixspan.setMinSupport(0.33).setMaxPatternLength(2)
val result3 = prefixspan.run(rdd)
val expectedValue3 = Array(
(Array(1), 4L),
(Array(1, 3), 2L),
(Array(1, 4), 2L),
(Array(1, 5), 2L),
(Array(2, 1), 2L),
(Array(2), 2L),
(Array(3), 5L),
(Array(3, 1), 2L),
(Array(3, 3), 2L),
(Array(3, 4), 3L),
(Array(3, 5), 2L),
(Array(4), 4L),
(Array(4, 5), 2L),
(Array(5), 3L)
)
assert(compareResult(expectedValue3, result3.collect()))
}
}