-
Notifications
You must be signed in to change notification settings - Fork 29k
[MLlib] [SPARK-2510]Word2Vec: Distributed Representation of Words #1719
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 1 commit
8d6befe
0aafb1b
e4a04d3
57dc50d
2e92b59
720b5a3
6bcc8be
7efbb6f
1a8fb41
e93e726
384c771
c14da41
26a948d
e248441
2ba9483
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
- Loading branch information
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -50,24 +50,27 @@ private case class VocabWord( | |
| * natural language processing and machine learning algorithms. | ||
| * | ||
| * We used skip-gram model in our implementation and hierarchical softmax | ||
| * method to train the model. | ||
| * method to train the model. The variable names in the implementation | ||
| * mathes the original C implementation. | ||
| * | ||
| * For original C implementation, see https://code.google.com/p/word2vec/ | ||
| * For research papers, see | ||
| * Efficient Estimation of Word Representations in Vector Space | ||
| * and | ||
| * Distributed Representations of Words and Phrases and their Compositionality | ||
| * Distributed Representations of Words and Phrases and their Compositionality. | ||
| * @param size vector dimension | ||
| * @param startingAlpha initial learning rate | ||
| * @param window context words from [-window, window] | ||
| * @param minCount minimum frequncy to consider a vocabulary word | ||
| * @param parallelisum number of partitions to run Word2Vec | ||
| */ | ||
| @Experimental | ||
| class Word2Vec( | ||
| val size: Int, | ||
| val startingAlpha: Double, | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is word2vec sensitive to alpha? If not, we should try to expose less parameters to users.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. word2vec is sensitive to alpha. Larger alpha may generate meaningless result
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Maybe we can suggest a reasonable default value in the doc. |
||
| val window: Int, | ||
| val minCount: Int) | ||
| val minCount: Int, | ||
| val parallelism:Int = 1) | ||
| extends Serializable with Logging { | ||
|
|
||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Please leave a note that the variable/method names are to match the original C implementation. Then people understand why, e.g., we map |
||
| private val EXP_TABLE_SIZE = 1000 | ||
|
|
@@ -237,7 +240,7 @@ class Word2Vec( | |
| } | ||
| } | ||
|
|
||
| val newSentences = sentences.repartition(1).cache() | ||
| val newSentences = sentences.repartition(parallelism).cache() | ||
| val temp = Array.fill[Double](vocabSize * layer1Size)((Random.nextDouble - 0.5) / layer1Size) | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Try to fix the seed to make the computation reproducible. |
||
| val (aggSyn0, _, _, _) = | ||
| // TODO: broadcast temp instead of serializing it directly | ||
|
|
@@ -248,7 +251,7 @@ class Word2Vec( | |
| var wc = wordCount | ||
| if (wordCount - lastWordCount > 10000) { | ||
| lwc = wordCount | ||
| alpha = startingAlpha * (1 - wordCount.toDouble / (trainWordsCount + 1)) | ||
| alpha = startingAlpha * (1 - parallelism * wordCount.toDouble / (trainWordsCount + 1)) | ||
| if (alpha < startingAlpha * 0.0001) alpha = startingAlpha * 0.0001 | ||
| logInfo("wordCount = " + wordCount + ", alpha = " + alpha) | ||
| } | ||
|
|
@@ -296,7 +299,7 @@ class Word2Vec( | |
| val n = syn0_1.length | ||
| blas.daxpy(n, 1.0, syn0_2, 1, syn0_1, 1) | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Should use weighted sum to handle imbalanced partitions. |
||
| blas.daxpy(n, 1.0, syn1_2, 1, syn1_1, 1) | ||
| (syn0_1, syn0_2, lwc_1 + lwc_2, wc_1 + wc_2) | ||
| (syn0_1, syn1_1, lwc_1 + lwc_2, wc_1 + wc_2) | ||
| }) | ||
|
|
||
| val wordMap = new Array[(String, Array[Double])](vocabSize) | ||
|
|
@@ -309,7 +312,7 @@ class Word2Vec( | |
| i += 1 | ||
| } | ||
| val modelRDD = sc.parallelize(wordMap, modelPartitionNum) | ||
| .partitionBy(new HashPartitioner(modelPartitionNum)) | ||
| .partitionBy(new HashPartitioner(modelPartitionNum)).cache() | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
|
||
| new Word2VecModel(modelRDD) | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. please call |
||
| } | ||
| } | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need more docs here, for example, link to the C implementation and the original papers for word2vec.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and briefly explain what it does.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Btw, this is definitely an experimental feature. Please add
@Experimentaltag. Example:https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/RowMatrix.scala#L45