-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-12153][SPARK-7617][MLlib]add support of arbitrary length sentence and other tuning for Word2Vec #10152
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…tation of sentences in the input. add new similarity functions and add normalization option for distances in synonym finding add new accessor for internal structure(the vocabulary and wordindex) for convenience
|
@ygcao back up a moment and read https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark first. You need to update this PR and connect it to the JIRA. You have some code style problems here, and I'm not clear some of the ancillary changes make sense. This class is not intended to provide Euclidean distance, cosine measure. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can't change the API. This also isn't documented, or motivated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will add java doc to explain it.
for backward compatibility, I can add default value in this function instead of the newly introduced one.
|
Hi Sean, thanks for the comment. Sorry for not notice the title requirement for the pull request. Fixed the title of pull request. I did check the guideline, but didn't get time to check each line of it. updated codes with more javadocs to make the motivation and usage clearer. also made the change to make it strictly backward compatible for existing public functions. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Run dev/lint-scala to see errors. For instance, this is missing spaces. You don't need return
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ygcao what is the specific need for getVocabulary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
GetVocabulary and getwordvectors is useful when you need to join or iterator the built vectors in batch. While getvectors is only useful to lookup vector for one specific known word in vocabulary which throws exception when word is out of vocabulary. The usages are very different.
Will look into the dataframe version to see whether it can cover the batch usrcase to decide whether we can work around without adding getter here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regardless, as others have said, these extra methods should come in a separate PR.
|
@ygcao can we remove the code changes related to the Also, I think that things like distance metrics should live in a centralized place, and so changes to these are outside the scope of this particular model and the PR. In fact, the original |
|
Pinging @mengxr @MechCoder @jkbradley (and I think @Ishiihara was the original author of Word2Vec?) So let's focus this PR in on making the max sentence size configurable, if this is desirable? Looking a bit deeper, the sentence structure of the input is essentially discarded in https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala#L273. This dates back to the original implementation, and it does match the original Google implementation that treats end-of-line as a word boundary, then imposes a It's interesting to note that e.g. Gensim's implementation respects the sentence structure of the input data (https://github.com/piskvorky/gensim/blob/develop/gensim/models/word2vec.py#L120). Deeplearning4j seems to do the same. It does seem a little strange to me thinking about it to discard sentence boundaries. It does make sense for very large text corpuses. But Word2Vec is more general than that, and can be applied e.g. in recommendation settings, where the boundary between "sentences" as, say, a "user activity history", is more patently "discontinuos". Thoughts? On the face of it we can leave the implementation as is (as it is true to the original), optionally making the max sentence length a configurable param. Or we can look at using the "sentence" structure of the input data (perhaps making the behaviour configurable between this and the original impl). |
+1
How are they being discarded? Each row in the data given to MLlib is treated as a separate sentence; i.e., we aren't trying to model similarity across sentences (as far as I can tell glancing at it). If people want to work with a unit other than sentences, then each unit can be on a separate row. Am I misunderstanding? |
|
@jkbradley as far as I can see: The input to Then in L285, So the way I read the current code, it indeed does just treat the input as a stream of words, discards sentence boundaries, and uses a hard 1000 word limit on "sentences". Let me know if I missed something here. This is in fact matching what the Google impl does (from my quick look through the C code, e.g. L373-405 and L70 in So purely technically the current code is "correct" as it matches the original, but I'm not sure if it was intentional or not to use |
|
@MLnick Oops, you're right; we are completely ignoring sentence boundaries. I didn't look carefully enough at the code. I'll correct my comment above too.
Same here; I haven't looked at the follow-up literature sufficiently to say. If anyone watching has references, that'd be useful. |
|
Overall, I'd say it's unclear whether we need to modify our implementation. How about we look for use cases and see if people have reported differences between following and ignoring sentence boundaries, before continuing with lots of code changes? |
|
I have to say word2vec or skip gram can be absolutely affected by training data just like any other ML algorithm. I also can tell you I observed big differences when I apply different massage for the data at the scale of millions to billions sentences. |
|
To see is to believe, comparison is the key. You are encouraged to use my version(using a simple sentence splitter by dot and question mark. Btw:if your data is not text, I want to say Any sequence data has its natural boundary just like sentence.e.g user session's natural boundary is time span of continuous operations), and the old version to build models from the same set of text/data set and then compare them to see differences. |
|
Ok, so after some more digging into this and the original Google code and mailing list, actually I was incorrect in my original read of the Google impl - sorry for the confusion. It does in fact treat newlines ( This happens in L81-84, where See also these two Google group posts which make it more clear that sentences are newline-delimited - https://groups.google.com/forum/#!searchin/word2vec-toolkit/line$20braks/word2vec-toolkit/2elkT3cOMqo/DL_CsF1p8H8J and https://groups.google.com/forum/#!topic/word2vec-toolkit/3LAooMdrCl0 Given this, in fact our current implementation is not correct and using |
added configurable maxSentenceLength constraint added more document of getter functions
|
update codes:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please match the style of this comment to the other parameter setters (e.g. https://github.com/apache/spark/pull/10152/files#diff-88f4b62c382b26ef8e856b23f5167ccdL129), including the default parameter setting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ygcao could you update the comment style here to match the parameter setters in this class?
fixed a logic issue, sentence boundary was not really respected in the previous version.
|
added braces to make lint happy. Jenkins should happy now. |
|
Jenkins, retest this please |
| // will be translated into arrays of Index integer | ||
| val sentences: RDD[Array[Int]] = dataset.mapPartitions { | ||
| // Each sentence will map to 0 or more Array[Int] | ||
| sentenceIter => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally I prefer the style:
dataset.mapPartitions { sentenceIter =>
sentenceIter.flatMap { sentence =>
...
I don't think it's a hard rule for Spark but most other code follows this convention.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could even be dataset.mapPartitions { _.flatMap { sentence =>, which I kind of like, but I don't know how much it matters.
|
Test build #50915 has finished for PR 10152 at commit
|
|
It's getting to personal tastes now~~, still adopted suggestion though. Personally, I would like always to let machine to do the formatting and length limits(even adding braces for the if statement when we want to make it as a rule), if we don't like machine's default way, we can create template for Spark project to let machine do what the majority of spark community want(support eclipse is enough, intellij and others can adopt eclipse formatter's template), the key is that machine should be the guy to do the 'stupid' and repetitive work for us;) |
|
Agree, I'm ready to merge this. I'll CC @mengxr or @jkbradley in case they want a final comment today |
|
@srowen Thanks! I will make a quick pass. |
| * up to `maxSentenceLength` size (default: 1000) | ||
| */ | ||
| @Since("2.0.0") | ||
| def setMaxSentenceLength(maxSentenceLength: Int): this.type = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is not clear from the doc what "sentence length" means, number of words or number of characters. We can either update the doc or change the param name to maxWordsPerSentence to make this clear from the name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The param name comes from the original Google implementation. Either option (or both) works, but I guess I'd be marginally more in favour of amending the first line of doc to read ... maximum length (in words) of each ..., or something similar.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good.
|
made one pass and only minor comments |
|
addressed new comments. still kept the if statement as I explained by sample codes. |
| if (wordIndexes.nonEmpty) { | ||
| // break wordIndexes into trunks of maxSentenceLength when has more | ||
| val sentenceSplit = wordIndexes.grouped(maxSentenceLength) | ||
| sentenceSplit.map(_.toArray) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sentenceSplit should be an Iterator[Array[Int]]. So this line might be unnecessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch! I'll double check and change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry again, we can't do the change. Compiler will complain, Iterator can't be used for flatMap, it expects GenTraversableOnce
|
addressed the 'final' comment, and checked lint and test cases. shall we do the merge then? Thanks! |
| sentenceIter.flatMap { sentence => | ||
| // Sentence of words, some of which map to a word index | ||
| val wordIndexes = sentence.flatMap(bcVocabHash.value.get) | ||
| if (wordIndexes.nonEmpty) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
Done! sorry for missing the comments. |
|
Jenkins, retest this please |
|
Test build #51483 has finished for PR 10152 at commit
|
|
Merged to master |
|
Thanks everybody for the review and help! Cheers! |
add support of arbitrary length sentence by using the nature representation of sentences in the input.
add new similarity functions and add normalization option for distances in synonym finding
add new accessor for internal structure(the vocabulary and wordindex) for convenience
need instructions about how to set value for the @SInCE annotation for newly added public functions. 1.5.3?
jira link: https://issues.apache.org/jira/browse/SPARK-12153