-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-23686][ML][WIP] Better instrumentation #20837
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #88274 has finished for PR 20837 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I doubt it will log repeatedly here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not add a level param into Instrumentation : def log(msg: String). instead of the logWarning method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not log the whole histogram ( each label -> its weightSum ).
Only log min/max weightSum seems useless and user even do not know they related to which label.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is just proxy for balance in the dataset. We can log more, I just wanted to start by logging something.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be useful to add a utility for logging arrays and vectors (we can json encode them), but in the meantime I wanted to capture at least minimal information about the data balance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm OK with not logging the full histogram here. There's a typo, where "highestLabelWeight" is actually logging the min (not max)
172bf27 to
f8f379a
Compare
f8f379a to
8a3ce3e
Compare
|
Test build #88768 has finished for PR 20837 at commit
|
|
Test build #88770 has finished for PR 20837 at commit
|
jkbradley
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note to others: I've asked @WeichenXu123 to take this over from @MrBago who is temporarily unavailable.
@WeichenXu123 When you take this PR over, can you please use the subtask SPARK-23859 instead of SPARK-23686?
In LogisticRegression, there is one use of logError. Let's add logError to Instrumentation and use it in logreg.
Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm OK with not logging the full histogram here. There's a typo, where "highestLabelWeight" is actually logging the min (not max)
|
No problem. I will take over this. Thanks! |
|
We can close this issue now that it's been replaced by #20982 |
…nd logging levels ## What changes were proposed in this pull request? Initial PR for Instrumentation improvements: UUID and logging levels. This PR takes over apache#20837 Closes apache#20837 ## How was this patch tested? Manual. Author: Bago Amirbekian <[email protected]> Author: WeichenXu <[email protected]> Closes apache#20982 from WeichenXu123/better-instrumentation.
What changes were proposed in this pull request?
This PR is meant to show how we could better utilize the Instrumentation class in spark.ml.
How was this patch tested?
(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)
Please review http://spark.apache.org/contributing.html before opening a pull request.