[SPARK-12153][SPARK-7617][MLlib]add support of arbitrary length sentence and other tuning for Word2Vec #10152

ygcao · 2015-12-05T00:39:03Z

add support of arbitrary length sentence by using the nature representation of sentences in the input.

add new similarity functions and add normalization option for distances in synonym finding
add new accessor for internal structure(the vocabulary and wordindex) for convenience

need instructions about how to set value for the @SInCE annotation for newly added public functions. 1.5.3?

jira link: https://issues.apache.org/jira/browse/SPARK-12153

…tation of sentences in the input. add new similarity functions and add normalization option for distances in synonym finding add new accessor for internal structure(the vocabulary and wordindex) for convenience

srowen · 2015-12-05T14:58:41Z

@ygcao back up a moment and read https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark first. You need to update this PR and connect it to the JIRA. You have some code style problems here, and I'm not clear some of the ancillary changes make sense. This class is not intended to provide Euclidean distance, cosine measure.

srowen · 2015-12-05T14:59:53Z

mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala

You can't change the API. This also isn't documented, or motivated.

will add java doc to explain it.
for backward compatibility, I can add default value in this function instead of the newly introduced one.

ygcao · 2015-12-05T22:28:17Z

Hi Sean, thanks for the comment. Sorry for not notice the title requirement for the pull request. Fixed the title of pull request. I did check the guideline, but didn't get time to check each line of it.
As to the style, I'm pretty much following existing code, could you be more specific about what you found violated the style guide except for the @SInCE annotation which I needs guidance?
As to the distance function, agreed that separation is preferred. since cosineSimilarity is already in place in the class, I added more in place just to minimize my footprint of changes. If you can point out any existing class for placing those functions, that'll be great. I don't mind to add a new class just for distance functions, that means bigger changes though. And I don't mind to pull back my added new distance function either since those changes are cosmetic. The essential change is to discard the fixed size sentence splitting which is not making sense, and provide flexibility of sentence splitting in the caller.

updated codes with more javadocs to make the motivation and usage clearer. also made the change to make it strictly backward compatible for existing public functions.
add @mengxr, @rxin and @MechCoder following the guide doc for more eyes on this.
Thanks!

srowen · 2015-12-06T09:38:37Z

mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala

Run dev/lint-scala to see errors. For instance, this is missing spaces. You don't need return

@ygcao what is the specific need for getVocabulary?

GetVocabulary and getwordvectors is useful when you need to join or iterator the built vectors in batch. While getvectors is only useful to lookup vector for one specific known word in vocabulary which throws exception when word is out of vocabulary. The usages are very different.
Will look into the dataframe version to see whether it can cover the batch usrcase to decide whether we can work around without adding getter here.

Regardless, as others have said, these extra methods should come in a separate PR.

MLnick · 2015-12-07T11:57:37Z

@ygcao can we remove the code changes related to the cosineSimilarity and findSynonyms? This will make it cleaner so we can focus on the core issue (see my next comment).

Also, I think that things like distance metrics should live in a centralized place, and so changes to these are outside the scope of this particular model and the PR. In fact, the original cosineSimilarity method (https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala#L472) is no longer used, as it's been replaced by a more efficient version, so we should consider deprecating it (or indeed just removing it as it's private and always has been).

MLnick · 2015-12-07T13:37:07Z

Pinging @mengxr @MechCoder @jkbradley (and I think @Ishiihara was the original author of Word2Vec?)

So let's focus this PR in on making the max sentence size configurable, if this is desirable?

Looking a bit deeper, the sentence structure of the input is essentially discarded in https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala#L273. This dates back to the original implementation, and it does match the original Google implementation that treats end-of-line as a word boundary, then imposes a MAX_SENTENCE_LENGTH of 1000 when processing the word stream.

It's interesting to note that e.g. Gensim's implementation respects the sentence structure of the input data (https://github.com/piskvorky/gensim/blob/develop/gensim/models/word2vec.py#L120). Deeplearning4j seems to do the same.

It does seem a little strange to me thinking about it to discard sentence boundaries. It does make sense for very large text corpuses. But Word2Vec is more general than that, and can be applied e.g. in recommendation settings, where the boundary between "sentences" as, say, a "user activity history", is more patently "discontinuos".

Thoughts? On the face of it we can leave the implementation as is (as it is true to the original), optionally making the max sentence length a configurable param. Or we can look at using the "sentence" structure of the input data (perhaps making the behaviour configurable between this and the original impl).

jkbradley · 2015-12-08T03:46:39Z

@MLnick

So let's focus this PR in on making the max sentence size configurable, if this is desirable?

+1

It does seem a little strange to me thinking about it to discard sentence boundaries.

How are they being discarded? Each row in the data given to MLlib is treated as a separate sentence; i.e., we aren't trying to model similarity across sentences (as far as I can tell glancing at it). If people want to work with a unit other than sentences, then each unit can be on a separate row. Am I misunderstanding?

MLnick · 2015-12-08T07:31:44Z

@jkbradley as far as I can see:

The input to fit is dataset: RDD[Iterable[String]]. However, in L273, we create val words = dataset.flatMap(x => x), flattening out the RDD of sentences to an RDD of words, i.e. RDD[Iterable[String]] -> RDD[String].

Then in L285, words is used in mapPartitions (as opposed to dataset). In the mapPartitions block, the behaviour of next is to advance through the iterator (which is now the flatMapped stream of words, not a stream of sentences), and have a hard cut off to create each sentence Array[Int] after MAX_SENTENCE_LENGTH is reached.

So the way I read the current code, it indeed does just treat the input as a stream of words, discards sentence boundaries, and uses a hard 1000 word limit on "sentences". Let me know if I missed something here.

This is in fact matching what the Google impl does (from my quick look through the C code, e.g. L373-405 and L70 in ReadWord).

So purely technically the current code is "correct" as it matches the original, but I'm not sure if it was intentional or not to use words in the mapPartitions block. But to me it's an open question whether this approach, or keeping sentence structure / boundaries, is better.

jkbradley · 2015-12-08T18:37:23Z

@MLnick Oops, you're right; we are completely ignoring sentence boundaries. I didn't look carefully enough at the code. I'll correct my comment above too.

it's an open question whether this approach, or keeping sentence structure / boundaries, is better

Same here; I haven't looked at the follow-up literature sufficiently to say. If anyone watching has references, that'd be useful.

jkbradley · 2015-12-08T18:41:10Z

Overall, I'd say it's unclear whether we need to modify our implementation. How about we look for use cases and see if people have reported differences between following and ignoring sentence boundaries, before continuing with lots of code changes?

ygcao · 2015-12-10T06:19:54Z

I have to say word2vec or skip gram can be absolutely affected by training data just like any other ML algorithm. I also can tell you I observed big differences when I apply different massage for the data at the scale of millions to billions sentences.
Researcher often tries to simplify engineering details, one obvious example is that the phrase2vec is highly simplified to show the algorithm's effectiveness instead of relying on complex entity recognition engine,that doesn't mean we should not do more advanced phrase construction for the training.
It's quite intuitive about the benefit of taking use of sentence boundaries when you thinking in term of expected output of skip gram and how back propagation works. The beginning words of the next sentence is a garbage input as context words for the sentence tail word at the training stage. From another side, in the original version,The words around the document fixed-size cutting point are losing semantically meaningful context words. Those words at sentence ending or around cutting points are minority, so you may not notice huge impact for the hard cut version, but that's not a reason for us don't improve it further. Again, model building is absolutely sensitive to input data, just we human don't sensitive to minority caused issues without deep dive, but a good thing is that we still have theory to think about.what harm can be brought with respecting sentence boundary?

ygcao · 2015-12-10T06:53:20Z

To see is to believe, comparison is the key. You are encouraged to use my version(using a simple sentence splitter by dot and question mark. Btw:if your data is not text, I want to say Any sequence data has its natural boundary just like sentence.e.g user session's natural boundary is time span of continuous operations), and the old version to build models from the same set of text/data set and then compare them to see differences.

MLnick · 2015-12-10T11:44:32Z

Ok, so after some more digging into this and the original Google code and mailing list, actually I was incorrect in my original read of the Google impl - sorry for the confusion. It does in fact treat newlines (\n) as a sentence boundary, and additionally imposes the limit of 1000 for MAX_SENTENCE_LENGTH.

This happens in L81-84, where \n is allocated a special sentence delimiter of </s>, which is later kept at position 0 in the vocab (see comment L148), and in turn results in breaking out of the sentence construction loop in L393.

See also these two Google group posts which make it more clear that sentences are newline-delimited - https://groups.google.com/forum/#!searchin/word2vec-toolkit/line$20braks/word2vec-toolkit/2elkT3cOMqo/DL_CsF1p8H8J and https://groups.google.com/forum/#!topic/word2vec-toolkit/3LAooMdrCl0

Given this, in fact our current implementation is not correct and using words instead of dataset in the mapPartitions block is a bug.

added configurable maxSentenceLength constraint added more document of getter functions

ygcao · 2015-12-12T05:24:11Z

update codes:

passed lint-scala check by adding spaces and cut comment line lengths
removed distance functions as we discussed, including the existing zombie one(not used and not accessible from outside)
made the changes to add back maxSentenceSize as a configurable variable while still respect sentence boundary by default, just to meet demands of people who will construct sentences in a unimaginable way. I'll personally always set it large enough(up to document size) to never affect nature sentence boundary. Anyway, it's good to have an option for different people's need.

MLnick · 2015-12-12T07:43:57Z

mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala

Please match the style of this comment to the other parameter setters (e.g. https://github.com/apache/spark/pull/10152/files#diff-88f4b62c382b26ef8e856b23f5167ccdL129), including the default parameter setting.

@ygcao could you update the comment style here to match the parameter setters in this class?

fixed a logic issue, sentence boundary was not really respected in the previous version.

ygcao · 2016-02-08T07:35:42Z

added braces to make lint happy. Jenkins should happy now.

srowen · 2016-02-08T08:28:57Z

Jenkins, retest this please

MLnick · 2016-02-08T08:29:26Z

mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala

+    // will be translated into arrays of Index integer
+    val sentences: RDD[Array[Int]] = dataset.mapPartitions {
+      // Each sentence will map to 0 or more Array[Int]
+      sentenceIter =>


Generally I prefer the style:

dataset.mapPartitions { sentenceIter => sentenceIter.flatMap { sentence => ...

I don't think it's a hard rule for Spark but most other code follows this convention.

This could even be dataset.mapPartitions { _.flatMap { sentence =>, which I kind of like, but I don't know how much it matters.

SparkQA · 2016-02-08T09:26:47Z

Test build #50915 has finished for PR 10152 at commit 84a0bc4.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

ygcao · 2016-02-10T07:32:25Z

It's getting to personal tastes now~~, still adopted suggestion though. Personally, I would like always to let machine to do the formatting and length limits(even adding braces for the if statement when we want to make it as a rule), if we don't like machine's default way, we can create template for Spark project to let machine do what the majority of spark community want(support eclipse is enough, intellij and others can adopt eclipse formatter's template), the key is that machine should be the guy to do the 'stupid' and repetitive work for us;)
seems we can create an issue for Spark about this: create an template for IDE formatter for Spark contributors.

srowen · 2016-02-10T11:05:11Z

Agree, I'm ready to merge this. I'll CC @mengxr or @jkbradley in case they want a final comment today

mengxr · 2016-02-10T21:35:50Z

@srowen Thanks! I will make a quick pass.

mengxr · 2016-02-10T21:46:31Z

mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala

+   * up to `maxSentenceLength` size (default: 1000)
+   */
+  @Since("2.0.0")
+  def setMaxSentenceLength(maxSentenceLength: Int): this.type = {


It is not clear from the doc what "sentence length" means, number of words or number of characters. We can either update the doc or change the param name to maxWordsPerSentence to make this clear from the name.

The param name comes from the original Google implementation. Either option (or both) works, but I guess I'd be marginally more in favour of amending the first line of doc to read ... maximum length (in words) of each ..., or something similar.

Sounds good.

mengxr · 2016-02-10T21:47:16Z

made one pass and only minor comments

ygcao · 2016-02-12T06:22:59Z

addressed new comments. still kept the if statement as I explained by sample codes.
reran test and lint test. Jenkins should still be happy 🎆

mengxr · 2016-02-12T09:56:46Z

mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala

+          if (wordIndexes.nonEmpty) {
+            // break wordIndexes into trunks of maxSentenceLength when has more
+            val sentenceSplit = wordIndexes.grouped(maxSentenceLength)
+            sentenceSplit.map(_.toArray)


sentenceSplit should be an Iterator[Array[Int]]. So this line might be unnecessary.

Good catch! I'll double check and change.

Sorry again, we can't do the change. Compiler will complain, Iterator can't be used for flatMap, it expects GenTraversableOnce

ygcao · 2016-02-16T05:40:14Z

addressed the 'final' comment, and checked lint and test cases. shall we do the merge then? Thanks!

MLnick · 2016-02-16T07:22:41Z

mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala

+      sentenceIter.flatMap { sentence =>
+        // Sentence of words, some of which map to a word index
+        val wordIndexes = sentence.flatMap(bcVocabHash.value.get)
+        if (wordIndexes.nonEmpty) {


@ygcao you have kept the if statement here, which I believe both @mengxr and @srowen have shown is not necessary.

ygcao · 2016-02-17T05:44:17Z

Done! sorry for missing the comments.

srowen · 2016-02-18T11:04:58Z

Jenkins, retest this please

SparkQA · 2016-02-18T12:03:58Z

Test build #51483 has finished for PR 10152 at commit a4abd40.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

srowen · 2016-02-20T09:13:02Z

At long last I think ready to go. @mengxr any more comments? or @MLnick

srowen · 2016-02-22T09:47:49Z

Merged to master

MLnick · 2016-02-22T10:45:19Z

@srowen thanks for merging it.

@ygcao thanks for the PR!

ygcao · 2016-02-22T18:43:54Z

Thanks everybody for the review and help! Cheers!

add support of arbitrary length sentence by using the nature represen…

dab8425

…tation of sentences in the input. add new similarity functions and add normalization option for distances in synonym finding add new accessor for internal structure(the vocabulary and wordindex) for convenience

srowen reviewed Dec 5, 2015
View reviewed changes

ygcao changed the title ~~add support of arbitrary length sentence by using the nature representation of sentences in the input and other minor tunings~~ [SPARK-12153][MLLib]add support of arbitrary length sentence and other tunings Dec 5, 2015

ygcao changed the title ~~[SPARK-12153][MLLib]add support of arbitrary length sentence and other tunings~~ [SPARK-12153][MLlib]add support of arbitrary length sentence and other tunings Dec 5, 2015

added more javadoc to make changes more clear for usage

05d1eae

ygcao changed the title ~~[SPARK-12153][MLlib]add support of arbitrary length sentence and other tunings~~ [SPARK-12153][MLlib]add support of arbitrary length sentence and other tuning for Word2Vec Dec 6, 2015

srowen reviewed Dec 6, 2015
View reviewed changes

removed distance functions

c226e9b

added configurable maxSentenceLength constraint added more document of getter functions

fixed a wrong placement of statement and formatted more

bd67405

MLnick reviewed Dec 12, 2015
View reviewed changes

addressed comments

7d4e9dd

fixed a logic issue, sentence boundary was not really respected in the previous version.

MLnick reviewed Feb 8, 2016
View reviewed changes

beautify according to comments

32df78e

mengxr reviewed Feb 10, 2016
View reviewed changes

address comments for comments and var name

84feb3d

mengxr reviewed Feb 12, 2016
View reviewed changes

removed extra braces

2e052e5

MLnick reviewed Feb 16, 2016
View reviewed changes

removed unnecessary if statement

a4abd40

asfgit closed this in ef1047f Feb 22, 2016

[SPARK-12153][SPARK-7617][MLlib]add support of arbitrary length sentence and other tuning for Word2Vec #10152

[SPARK-12153][SPARK-7617][MLlib]add support of arbitrary length sentence and other tuning for Word2Vec #10152

Uh oh!

Conversation

ygcao commented Dec 5, 2015

Uh oh!

srowen commented Dec 5, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ygcao commented Dec 5, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MLnick commented Dec 7, 2015

Uh oh!

MLnick commented Dec 7, 2015

Uh oh!

jkbradley commented Dec 8, 2015

Uh oh!

MLnick commented Dec 8, 2015

Uh oh!

jkbradley commented Dec 8, 2015

Uh oh!

jkbradley commented Dec 8, 2015

Uh oh!

ygcao commented Dec 10, 2015

Uh oh!

ygcao commented Dec 10, 2015

Uh oh!

MLnick commented Dec 10, 2015

Uh oh!

ygcao commented Dec 12, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ygcao commented Feb 8, 2016

Uh oh!

srowen commented Feb 8, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Feb 8, 2016

Uh oh!

ygcao commented Feb 10, 2016

Uh oh!

srowen commented Feb 10, 2016

Uh oh!

mengxr commented Feb 10, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mengxr commented Feb 10, 2016

Uh oh!

ygcao commented Feb 12, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ygcao commented Feb 16, 2016

Uh oh!