Skip to content

Conversation

@MechCoder
Copy link
Contributor

  1. Prevent creating a map of data to find numFeatures
  2. If model is empty, then initialize with a zero vector of numFeature

@MechCoder MechCoder changed the title [SPARK-8140] [MLlib] Minor internal improvements in Streaming MLlib Algorithms [SPARK-8140] [Minor] [MLlib] Minor internal improvements in Streaming MLlib Algorithms Jun 6, 2015
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see your intent here, but if the vectors are very large, it might actually be faster to get its size remotely and return only the size to the driver. But yes it entails a small distributed operation. I'm not sure which is better.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is also a good possibility that input is very large. Is accessing the first element of a RDD slower than doing a distributed operation across the entire RDD?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would only evaluate one element in this case. first() accesses the first partition and first element of its iterator, so map should only be applied once.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, I had thought, input.map(_.features.size) is evaluated first and then from that the first element is extracted. Should I revert this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could double-check with a very quick/small benchmark to see if it is in fact evaluated that way and if there is much difference at all. I suspect both are OK and don't know which is better.

@MechCoder MechCoder force-pushed the spark-8140 branch 2 times, most recently from d5d6178 to 4f214fc Compare June 6, 2015 14:02
@SparkQA
Copy link

SparkQA commented Jun 6, 2015

Test build #34361 has finished for PR 6684 at commit e043b36.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 6, 2015

Test build #34362 has finished for PR 6684 at commit 1fa8501.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 6, 2015

Test build #34363 has finished for PR 6684 at commit d5d6178.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@MechCoder
Copy link
Contributor Author

@srowen I removed the None check and restored the model.isEmpty check.

@SparkQA
Copy link

SparkQA commented Jun 6, 2015

Test build #34364 has finished for PR 6684 at commit 4f214fc.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@MechCoder MechCoder changed the title [SPARK-8140] [Minor] [MLlib] Minor internal improvements in Streaming MLlib Algorithms [SPARK-8140] [MLlib] Minor internal improvements in Streaming MLlib Algorithms Jun 6, 2015
@MechCoder MechCoder changed the title [SPARK-8140] [MLlib] Minor internal improvements in Streaming MLlib Algorithms [SPARK-8140] [MLlib] Remove empty model check in StreamingLinearAlgorithm Jun 6, 2015
@SparkQA
Copy link

SparkQA commented Jun 6, 2015

Test build #34369 has finished for PR 6684 at commit 50ba5e2.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 6, 2015

Test build #34370 has finished for PR 6684 at commit 7fbf5f9.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@srowen
Copy link
Member

srowen commented Jun 6, 2015

I'm OK with this. CC @mengxr and/or @jkbradley for a second look, so I'll leave it open a little while.

@MechCoder
Copy link
Contributor Author

Also if input.first().features.size is better than input.map(_.features.size).first()

@asfgit asfgit closed this in e3e9c70 Jun 8, 2015
@MechCoder MechCoder deleted the spark-8140 branch June 8, 2015 14:47
@jkbradley
Copy link
Member

Sorry for the slow response! I think it looks OK, though I wonder if the incomplete match causes a compilation warning.

@srowen
Copy link
Member

srowen commented Jun 8, 2015

Hm, good point. That results in a warning actually, but no runtime problem since the model must exist here. Really this whole construct can be removed now; I should have seen that. Well we can follow this up with that additional change to further simplify and remove a warning.

nemccarthy pushed a commit to nemccarthy/spark that referenced this pull request Jun 19, 2015
…ithm

1. Prevent creating a map of data to find numFeatures
2. If model is empty, then initialize with a zero vector of numFeature

Author: MechCoder <[email protected]>

Closes apache#6684 from MechCoder/spark-8140 and squashes the following commits:

7fbf5f9 [MechCoder] [SPARK-8140] Remove empty model check in StreamingLinearAlgorithm And other minor cosmits
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants