Skip to content

Conversation

@MrBago
Copy link
Contributor

@MrBago MrBago commented Mar 15, 2018

What changes were proposed in this pull request?

This PR is meant to show how we could better utilize the Instrumentation class in spark.ml.

How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)

Please review http://spark.apache.org/contributing.html before opening a pull request.

@SparkQA
Copy link

SparkQA commented Mar 15, 2018

Test build #88274 has finished for PR 20837 at commit 172bf27.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • instr.logWarning(s\"All labels belong to a single class and fitIntercept=false. It's a \" +

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I doubt it will log repeatedly here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not add a level param into Instrumentation : def log(msg: String). instead of the logWarning method.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not log the whole histogram ( each label -> its weightSum ).
Only log min/max weightSum seems useless and user even do not know they related to which label.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just proxy for balance in the dataset. We can log more, I just wanted to start by logging something.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be useful to add a utility for logging arrays and vectors (we can json encode them), but in the meantime I wanted to capture at least minimal information about the data balance.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm OK with not logging the full histogram here. There's a typo, where "highestLabelWeight" is actually logging the min (not max)

@MrBago MrBago force-pushed the better-instrumentation branch from 172bf27 to f8f379a Compare March 30, 2018 23:13
@MrBago MrBago force-pushed the better-instrumentation branch from f8f379a to 8a3ce3e Compare March 30, 2018 23:34
@SparkQA
Copy link

SparkQA commented Mar 31, 2018

Test build #88768 has finished for PR 20837 at commit f8f379a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 31, 2018

Test build #88770 has finished for PR 20837 at commit 8a3ce3e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • instr.logWarning(s\"All labels belong to a single class and fitIntercept=false. It's a \" +

Copy link
Member

@jkbradley jkbradley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note to others: I've asked @WeichenXu123 to take this over from @MrBago who is temporarily unavailable.

@WeichenXu123 When you take this PR over, can you please use the subtask SPARK-23859 instead of SPARK-23686?

In LogisticRegression, there is one use of logError. Let's add logError to Instrumentation and use it in logreg.

Thanks!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm OK with not logging the full histogram here. There's a typo, where "highestLabelWeight" is actually logging the min (not max)

@WeichenXu123
Copy link
Contributor

No problem. I will take over this. Thanks!

@jkbradley
Copy link
Member

We can close this issue now that it's been replaced by #20982

@asfgit asfgit closed this in d23a805 Apr 6, 2018
robert3005 pushed a commit to palantir/spark that referenced this pull request Apr 7, 2018
…nd logging levels

## What changes were proposed in this pull request?

Initial PR for Instrumentation improvements: UUID and logging levels.
This PR takes over apache#20837

Closes apache#20837

## How was this patch tested?

Manual.

Author: Bago Amirbekian <[email protected]>
Author: WeichenXu <[email protected]>

Closes apache#20982 from WeichenXu123/better-instrumentation.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants