Skip to content

Conversation

@ygcao
Copy link

@ygcao ygcao commented Dec 5, 2015

add support of arbitrary length sentence by using the nature representation of sentences in the input.

add new similarity functions and add normalization option for distances in synonym finding
add new accessor for internal structure(the vocabulary and wordindex) for convenience

need instructions about how to set value for the @SInCE annotation for newly added public functions. 1.5.3?

jira link: https://issues.apache.org/jira/browse/SPARK-12153

…tation of sentences in the input.

add new similarity functions and add normalization option for distances in synonym finding
add new accessor for internal structure(the vocabulary and wordindex) for convenience
@srowen
Copy link
Member

srowen commented Dec 5, 2015

@ygcao back up a moment and read https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark first. You need to update this PR and connect it to the JIRA. You have some code style problems here, and I'm not clear some of the ancillary changes make sense. This class is not intended to provide Euclidean distance, cosine measure.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can't change the API. This also isn't documented, or motivated.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will add java doc to explain it.
for backward compatibility, I can add default value in this function instead of the newly introduced one.

@ygcao ygcao changed the title add support of arbitrary length sentence by using the nature representation of sentences in the input and other minor tunings [SPARK-12153][MLLib]add support of arbitrary length sentence and other tunings Dec 5, 2015
@ygcao ygcao changed the title [SPARK-12153][MLLib]add support of arbitrary length sentence and other tunings [SPARK-12153][MLlib]add support of arbitrary length sentence and other tunings Dec 5, 2015
@ygcao
Copy link
Author

ygcao commented Dec 5, 2015

Hi Sean, thanks for the comment. Sorry for not notice the title requirement for the pull request. Fixed the title of pull request. I did check the guideline, but didn't get time to check each line of it.
As to the style, I'm pretty much following existing code, could you be more specific about what you found violated the style guide except for the @SInCE annotation which I needs guidance?
As to the distance function, agreed that separation is preferred. since cosineSimilarity is already in place in the class, I added more in place just to minimize my footprint of changes. If you can point out any existing class for placing those functions, that'll be great. I don't mind to add a new class just for distance functions, that means bigger changes though. And I don't mind to pull back my added new distance function either since those changes are cosmetic. The essential change is to discard the fixed size sentence splitting which is not making sense, and provide flexibility of sentence splitting in the caller.

updated codes with more javadocs to make the motivation and usage clearer. also made the change to make it strictly backward compatible for existing public functions.
add @mengxr, @rxin and @MechCoder following the guide doc for more eyes on this.
Thanks!

@ygcao ygcao changed the title [SPARK-12153][MLlib]add support of arbitrary length sentence and other tunings [SPARK-12153][MLlib]add support of arbitrary length sentence and other tuning for Word2Vec Dec 6, 2015
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Run dev/lint-scala to see errors. For instance, this is missing spaces. You don't need return

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ygcao what is the specific need for getVocabulary?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GetVocabulary and getwordvectors is useful when you need to join or iterator the built vectors in batch. While getvectors is only useful to lookup vector for one specific known word in vocabulary which throws exception when word is out of vocabulary. The usages are very different.
Will look into the dataframe version to see whether it can cover the batch usrcase to decide whether we can work around without adding getter here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regardless, as others have said, these extra methods should come in a separate PR.

@MLnick
Copy link
Contributor

MLnick commented Dec 7, 2015

@ygcao can we remove the code changes related to the cosineSimilarity and findSynonyms? This will make it cleaner so we can focus on the core issue (see my next comment).

Also, I think that things like distance metrics should live in a centralized place, and so changes to these are outside the scope of this particular model and the PR. In fact, the original cosineSimilarity method (https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala#L472) is no longer used, as it's been replaced by a more efficient version, so we should consider deprecating it (or indeed just removing it as it's private and always has been).

@MLnick
Copy link
Contributor

MLnick commented Dec 7, 2015

Pinging @mengxr @MechCoder @jkbradley (and I think @Ishiihara was the original author of Word2Vec?)

So let's focus this PR in on making the max sentence size configurable, if this is desirable?

Looking a bit deeper, the sentence structure of the input is essentially discarded in https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala#L273. This dates back to the original implementation, and it does match the original Google implementation that treats end-of-line as a word boundary, then imposes a MAX_SENTENCE_LENGTH of 1000 when processing the word stream.

It's interesting to note that e.g. Gensim's implementation respects the sentence structure of the input data (https://github.com/piskvorky/gensim/blob/develop/gensim/models/word2vec.py#L120). Deeplearning4j seems to do the same.

It does seem a little strange to me thinking about it to discard sentence boundaries. It does make sense for very large text corpuses. But Word2Vec is more general than that, and can be applied e.g. in recommendation settings, where the boundary between "sentences" as, say, a "user activity history", is more patently "discontinuos".

Thoughts? On the face of it we can leave the implementation as is (as it is true to the original), optionally making the max sentence length a configurable param. Or we can look at using the "sentence" structure of the input data (perhaps making the behaviour configurable between this and the original impl).

@jkbradley
Copy link
Member

@MLnick

So let's focus this PR in on making the max sentence size configurable, if this is desirable?

+1

It does seem a little strange to me thinking about it to discard sentence boundaries.

How are they being discarded? Each row in the data given to MLlib is treated as a separate sentence; i.e., we aren't trying to model similarity across sentences (as far as I can tell glancing at it). If people want to work with a unit other than sentences, then each unit can be on a separate row. Am I misunderstanding?

@MLnick
Copy link
Contributor

MLnick commented Dec 8, 2015

@jkbradley as far as I can see:

The input to fit is dataset: RDD[Iterable[String]]. However, in L273, we create val words = dataset.flatMap(x => x), flattening out the RDD of sentences to an RDD of words, i.e. RDD[Iterable[String]] -> RDD[String].

Then in L285, words is used in mapPartitions (as opposed to dataset). In the mapPartitions block, the behaviour of next is to advance through the iterator (which is now the flatMapped stream of words, not a stream of sentences), and have a hard cut off to create each sentence Array[Int] after MAX_SENTENCE_LENGTH is reached.

So the way I read the current code, it indeed does just treat the input as a stream of words, discards sentence boundaries, and uses a hard 1000 word limit on "sentences". Let me know if I missed something here.

This is in fact matching what the Google impl does (from my quick look through the C code, e.g. L373-405 and L70 in ReadWord).

So purely technically the current code is "correct" as it matches the original, but I'm not sure if it was intentional or not to use words in the mapPartitions block. But to me it's an open question whether this approach, or keeping sentence structure / boundaries, is better.

@jkbradley
Copy link
Member

@MLnick Oops, you're right; we are completely ignoring sentence boundaries. I didn't look carefully enough at the code. I'll correct my comment above too.

it's an open question whether this approach, or keeping sentence structure / boundaries, is better

Same here; I haven't looked at the follow-up literature sufficiently to say. If anyone watching has references, that'd be useful.

@jkbradley
Copy link
Member

Overall, I'd say it's unclear whether we need to modify our implementation. How about we look for use cases and see if people have reported differences between following and ignoring sentence boundaries, before continuing with lots of code changes?

@ygcao
Copy link
Author

ygcao commented Dec 10, 2015

I have to say word2vec or skip gram can be absolutely affected by training data just like any other ML algorithm. I also can tell you I observed big differences when I apply different massage for the data at the scale of millions to billions sentences.
Researcher often tries to simplify engineering details, one obvious example is that the phrase2vec is highly simplified to show the algorithm's effectiveness instead of relying on complex entity recognition engine,that doesn't mean we should not do more advanced phrase construction for the training.
It's quite intuitive about the benefit of taking use of sentence boundaries when you thinking in term of expected output of skip gram and how back propagation works. The beginning words of the next sentence is a garbage input as context words for the sentence tail word at the training stage. From another side, in the original version,The words around the document fixed-size cutting point are losing semantically meaningful context words. Those words at sentence ending or around cutting points are minority, so you may not notice huge impact for the hard cut version, but that's not a reason for us don't improve it further. Again, model building is absolutely sensitive to input data, just we human don't sensitive to minority caused issues without deep dive, but a good thing is that we still have theory to think about.what harm can be brought with respecting sentence boundary?

@ygcao
Copy link
Author

ygcao commented Dec 10, 2015

To see is to believe, comparison is the key. You are encouraged to use my version(using a simple sentence splitter by dot and question mark. Btw:if your data is not text, I want to say Any sequence data has its natural boundary just like sentence.e.g user session's natural boundary is time span of continuous operations), and the old version to build models from the same set of text/data set and then compare them to see differences.

@MLnick
Copy link
Contributor

MLnick commented Dec 10, 2015

Ok, so after some more digging into this and the original Google code and mailing list, actually I was incorrect in my original read of the Google impl - sorry for the confusion. It does in fact treat newlines (\n) as a sentence boundary, and additionally imposes the limit of 1000 for MAX_SENTENCE_LENGTH.

This happens in L81-84, where \n is allocated a special sentence delimiter of </s>, which is later kept at position 0 in the vocab (see comment L148), and in turn results in breaking out of the sentence construction loop in L393.

See also these two Google group posts which make it more clear that sentences are newline-delimited - https://groups.google.com/forum/#!searchin/word2vec-toolkit/line$20braks/word2vec-toolkit/2elkT3cOMqo/DL_CsF1p8H8J and https://groups.google.com/forum/#!topic/word2vec-toolkit/3LAooMdrCl0

Given this, in fact our current implementation is not correct and using words instead of dataset in the mapPartitions block is a bug.

added configurable maxSentenceLength constraint
added more document of getter functions
@ygcao
Copy link
Author

ygcao commented Dec 12, 2015

update codes:

  1. passed lint-scala check by adding spaces and cut comment line lengths
  2. removed distance functions as we discussed, including the existing zombie one(not used and not accessible from outside)
  3. made the changes to add back maxSentenceSize as a configurable variable while still respect sentence boundary by default, just to meet demands of people who will construct sentences in a unimaginable way. I'll personally always set it large enough(up to document size) to never affect nature sentence boundary. Anyway, it's good to have an option for different people's need.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please match the style of this comment to the other parameter setters (e.g. https://github.com/apache/spark/pull/10152/files#diff-88f4b62c382b26ef8e856b23f5167ccdL129), including the default parameter setting.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ygcao could you update the comment style here to match the parameter setters in this class?

fixed a logic issue, sentence boundary was not really respected in the previous version.
@ygcao
Copy link
Author

ygcao commented Feb 8, 2016

added braces to make lint happy. Jenkins should happy now.

@srowen
Copy link
Member

srowen commented Feb 8, 2016

Jenkins, retest this please

// will be translated into arrays of Index integer
val sentences: RDD[Array[Int]] = dataset.mapPartitions {
// Each sentence will map to 0 or more Array[Int]
sentenceIter =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally I prefer the style:

dataset.mapPartitions { sentenceIter =>
  sentenceIter.flatMap { sentence =>
    ...

I don't think it's a hard rule for Spark but most other code follows this convention.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could even be dataset.mapPartitions { _.flatMap { sentence =>, which I kind of like, but I don't know how much it matters.

@SparkQA
Copy link

SparkQA commented Feb 8, 2016

Test build #50915 has finished for PR 10152 at commit 84a0bc4.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@ygcao
Copy link
Author

ygcao commented Feb 10, 2016

It's getting to personal tastes now~~, still adopted suggestion though. Personally, I would like always to let machine to do the formatting and length limits(even adding braces for the if statement when we want to make it as a rule), if we don't like machine's default way, we can create template for Spark project to let machine do what the majority of spark community want(support eclipse is enough, intellij and others can adopt eclipse formatter's template), the key is that machine should be the guy to do the 'stupid' and repetitive work for us;)
seems we can create an issue for Spark about this: create an template for IDE formatter for Spark contributors.

@srowen
Copy link
Member

srowen commented Feb 10, 2016

Agree, I'm ready to merge this. I'll CC @mengxr or @jkbradley in case they want a final comment today

@mengxr
Copy link
Contributor

mengxr commented Feb 10, 2016

@srowen Thanks! I will make a quick pass.

* up to `maxSentenceLength` size (default: 1000)
*/
@Since("2.0.0")
def setMaxSentenceLength(maxSentenceLength: Int): this.type = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not clear from the doc what "sentence length" means, number of words or number of characters. We can either update the doc or change the param name to maxWordsPerSentence to make this clear from the name.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The param name comes from the original Google implementation. Either option (or both) works, but I guess I'd be marginally more in favour of amending the first line of doc to read ... maximum length (in words) of each ..., or something similar.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good.

@mengxr
Copy link
Contributor

mengxr commented Feb 10, 2016

made one pass and only minor comments

@ygcao
Copy link
Author

ygcao commented Feb 12, 2016

addressed new comments. still kept the if statement as I explained by sample codes.
reran test and lint test. Jenkins should still be happy 🎆

if (wordIndexes.nonEmpty) {
// break wordIndexes into trunks of maxSentenceLength when has more
val sentenceSplit = wordIndexes.grouped(maxSentenceLength)
sentenceSplit.map(_.toArray)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sentenceSplit should be an Iterator[Array[Int]]. So this line might be unnecessary.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! I'll double check and change.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry again, we can't do the change. Compiler will complain, Iterator can't be used for flatMap, it expects GenTraversableOnce

@ygcao
Copy link
Author

ygcao commented Feb 16, 2016

addressed the 'final' comment, and checked lint and test cases. shall we do the merge then? Thanks!

sentenceIter.flatMap { sentence =>
// Sentence of words, some of which map to a word index
val wordIndexes = sentence.flatMap(bcVocabHash.value.get)
if (wordIndexes.nonEmpty) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ygcao you have kept the if statement here, which I believe both @mengxr and @srowen have shown is not necessary.

@ygcao
Copy link
Author

ygcao commented Feb 17, 2016

Done! sorry for missing the comments.

@srowen
Copy link
Member

srowen commented Feb 18, 2016

Jenkins, retest this please

@SparkQA
Copy link

SparkQA commented Feb 18, 2016

Test build #51483 has finished for PR 10152 at commit a4abd40.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@srowen
Copy link
Member

srowen commented Feb 20, 2016

At long last I think ready to go. @mengxr any more comments? or @MLnick

@srowen
Copy link
Member

srowen commented Feb 22, 2016

Merged to master

@asfgit asfgit closed this in ef1047f Feb 22, 2016
@MLnick
Copy link
Contributor

MLnick commented Feb 22, 2016

@srowen thanks for merging it.

@ygcao thanks for the PR!

@ygcao
Copy link
Author

ygcao commented Feb 22, 2016

Thanks everybody for the review and help! Cheers!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants