-
Notifications
You must be signed in to change notification settings - Fork 126
RSDK-12690 Reorganize/name Subloggers for diagnostic logs #5565
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RSDK-12690 Reorganize/name Subloggers for diagnostic logs #5565
Conversation
|
Let's say I start With this PR, they now look like (just drawing attention to the new "rdk.config" logger): When app is running new diagnostic logs logic, the "Logs" tab contains only the following logs of the above by default: |
|
Spoke with @cheukt offline: I'm going to invert the list in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
edit
Just saw this:
Spoke with @cheukt offline: I'm going to invert the list in app to be diagnosticLoggerNames instead of userFacingLoggerNames and try to make fewer changes in this PR (keep rdk as top-level logger and do less segmenting).
Feel free to read over the feedback -- but don't feel compelled to make a specific change for stuff getting tweaked for the above.
original
Sorry I just stream of consciousness-bombed this PR. Trying to get thoughts out on a limited timeline.
Summarizing my thoughts:
- The high level PR description of what the top of the logger heirarchy is and what is user-facing/diagnostics makes sense.
- I see we're again trying to have an empty string logger. I still think
rdk.networkingis preferable to justnetworking. Especially cause we changed some logs to sayviam-serverbecause, without the context ofrdkin the logger name, no one really knows what the log refers to.- Not to mention this will be backwards breaking for existing logger patterns.
- Some backwards breaking is inevitable -- but one of my favorites to disable is
rdk.networking.... And I don't think that one needed to happen. - We probably have to circle back and consider whether log patterns should actually skip sending logs to app. Or just control whether app renders logs in the logs tab? Definitely not a now problem to solve.
- Some backwards breaking is inevitable -- but one of my favorites to disable is
- I might very well be the only person taking advantage of that anyways.
- But most importantly, I want to easily verify that nothing is ever outputted over the empty logger.
- Not to mention this will be backwards breaking for existing logger patterns.
- The 3 kinds of module loggers in the modmanager package is confusing
- There's a bunch of copy changes (read: tweaks to the words in existing log messages). Some make sense given the context of this PR. Some feel like they're drivebys. Many I think are a regression in some aspect of communicating important information. Even if they are an improvement in a different aspect.
- Can we split this PR into two pieces? The first piece being the logger name refactor. And then a second piece where we can debate copy? I think the logger name refactor is important for moving the project forward. And the copy changes are just a distraction.
module/modmanager/manager.go
Outdated
| } | ||
|
|
||
| mod.logger.CInfow(ctx, "Module configuration changed. Stopping the existing module process", "module", conf.Name) | ||
| mgr.modulesLogger.CInfow(ctx, "Now reconfiguring module", "module", conf.Name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is better? Why go from the module logger to the manager? I don't think the distinction matters to the user.
And what's the motivation for the copy change? "reconfiguring" isn't well defined. But "configuration changed" and "stopping process" are.
module/modmanager/manager.go
Outdated
| mod.logger.CInfow(ctx, "Resources handled by reconfigured module will be re-added to new module process", | ||
| "module", mod.cfg.Name, "resources", handledResourceNameStrings) | ||
|
|
||
| mgr.modulesLogger.CInfow(ctx, "Module successfully reconfigured", "module", conf.Name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does this tell us that the above two logs are not?
module/modmanager/manager.go
Outdated
| go func(i int, conf config.Module) { | ||
| defer wg.Done() | ||
| moduleLogger := mgr.logger.Sublogger(conf.Name) | ||
| mgr.modulesLogger.CInfow(ctx, "Now adding module", "module", conf.Name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why change which logger this goes to? I'm not sure why this is considered preferable.
module/modmanager/module.go
Outdated
| oue pexec.UnexpectedExitHandler, | ||
| viamHomeDir string, | ||
| packagesDir string, | ||
| modulesLogger logging.Logger, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's three loggers at play:
- manager logger
- manager module logger
- module logger (
m.logger)
Given lines 261/262 are sublogger-ing with the module name, I'm guessing this input is the manager module logger and not the module logger? Why are we not using m.logger?
| jobManager *jobmanager.JobManager | ||
| localPackages packages.ManagerSyncer | ||
| cloudConnSvc icloud.ConnectionService | ||
| logger logging.Logger |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we creating a convention where logger always refers to a diagnostics logger? The high level PR description makes sense, but I think keeping track of which loggers are which here is going to be rough.
robot/web/web.go
Outdated
| err = utils.FilterOutError(err, context.Canceled) | ||
| if err != nil { | ||
| logger.Errorw("error running web", "error", err) | ||
| shutdownLogger.Errorw("error running web", "error", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It wasn't obvious to me this change was right. As in, is this really a shutdown log? But it is.
Can we undo this weird defer thing? And just assume StartWeb is not being weird and panicing? And maybe change the message to be understandable(-ish) by a user: Error running RPC server. Shutting down.
At least I think startweb's primary purpose is to allow for (g)RPC invocations?
web/server/entrypoint.go
Outdated
| } | ||
| if len(versionFields) != 0 { | ||
| logger.Infow("Viam RDK", versionFields...) | ||
| logger.Infow("viam-server", versionFields...) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Grepping for Viam RDK is my goto for finding startup/shutdown periods in logs. Relying heavily on viam-server in logs I feel is going to greatly complicate that workflow. Maybe I'll commit "environment variables" to memory?
I'm not sure when this project decided to declare war on "RDK" (which I can half understand -- it's not really user-friendly).
module/modmanager/manager.go
Outdated
| // modulesLogger is used for adding, removing, and reconfiguring modules as well as | ||
| // module log output and is user-facing in app. | ||
| modulesLogger logging.Logger | ||
| // logger is used for all other module manager logs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we write out whether this is user facing or not?
There's also the third logger of mod.logger. I'm not sure what the rule is for which of the three to use for logging to. I'm hoping we can get away with just two? The individual module loggers and the mod manager (when not logging about any specific module -- for no modules or potentially multiple modules).
Not to mention the 4th module logger for stdout/stderr and the 5th module logger for module logs sent to the viam-server over gRPC.
config/logging_level.go
Outdated
| globalLogger.mu.Lock() | ||
| defer globalLogger.mu.Unlock() | ||
|
|
||
| globalLogger.logger = logger |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm guessing we're never supposed to write logs to logger. But this is the logger we call SetLevel on? Can we enhance the name.
My rule of thumb for naming variables is that -- if only one variable has some type, can just name it after the type. But once two variables have some type, it's easy and effective to come up with non-generic names/adjectives that describe their unique purpose and how the two (or more) variables co-exist.
This reverts commit edcfbbba74c924150907b6ec52aabba7a46d1a64.
0ad291e to
9e2161c
Compare
benjirewis
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks both! Took @cheukt's offline feedback and read through all your comments, @dgottlieb. I won't respond to the GH comments individually since quite a bit has changed in this PR, but I've updated the PR description and my previous comment showing example logs. Generally, I now have a list of hard-coded diagnosticLoggerNames in the app PR instead of userFacingLoggerNames, and have avoided unnecessary segmentation/copy changes. Let me know what you think!
benjirewis
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Discussed further offline: I've removed "rdk.modmanager.", "rdk.package_manager", and "rdk.resource_manager" from the list of diagnostic logger names in the app PR. I've also removed all changes to module/modmanager/manager.go (no need to move anything from rdk.modmanager.[modulename] to rdk.modmanager). Finally, I updated the PR description and my previous comment showing example logs once more. This PR basically just adds a "rdk.config" Sublogger now.
@cheukt I found that rdk.package_manager should probably be user-facing; it does indeed log the download percentage of all packages, and I was making it diagnostic only to hide a single startup and a single shutdown log.
Let me know what you think! The final set of user-facing logs seems succinct enough but also more conservatively includes information.
web/server/entrypoint.go
Outdated
| golog.ReplaceGloabl(logger.AsZap()) | ||
| // Have goutils use the networking Sublogger of the root logger we've just configured. | ||
| // most of goutils is related to networking logic. | ||
| golog.ReplaceGloabl(networkingLogger.AsZap()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see no usages of the golog dependency anywhere but here in rdk, @dgottlieb.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think the change to the networking logger here is correct. In fact I only see one networking log (by number of lines -- not necessarily by number of invocations):
rpc/dialer.go: golog.Global().Debugw("close referenced conn", "proto", rc.proto)
I'm not sure what the remark is suggesting, but I think the question is -- is this "safe" to remove?
If I remember correctly, I had (presumably also) thought we could remove this (and did). But then learned that there were logs in goutils that used the golog global logger. And those logs stopped getting sent to app. I think the most damaging logs to omit were those from panic capturing go. But there's also UncheckedError (and family).
I don't know if those need to be user-facing, so networkingLogger here might be fine. But want to make an intentional choice here.
W.r.t potentially removing this: I think we can only remove this for real once goutils stops relying on the global golog logger.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking more closely at the usages of the global logger in goutils, it does seem like it's NOT necessarily used for "networking"-related messages in goutils:
testutils/ext/verify.go:
70: golog.Global().Fatalw("error opening artifact", "error", err)
76: golog.Global().Errorw("error closing artifact", "error", err)
testutils/env.go:
15: logger = golog.Global().Named("test")
runtime.go:
153: golog.Global().Errorw("panic while running function", "error", err)
158: golog.Global().Infow("waiting a bit to call callback", "wait", waitDur.String())
logger.go
15:var Logger = golog.Global()
perf/statz/summations.go:
96: golog.Global().Errorf("faild to write metric %s", err)
web/templates.go:
219: golog.Global().Errorf("protoJson failed to marshal: %s", err)
perf/statz/counters.go:
112: golog.Global().Errorf("faild to write metric %s", err)
error.go:
65: golog.Global().Desugar().WithOptions(zap.AddCallerSkip(2)).Sugar().Debugw("unchecked error", "error", err)
perf/statz/statztest/recorder.go:
138: golog.Global().Panic("Expected even number of keypairs")
perf/statz/gauges.go:
81: golog.Global().Errorf("faild to write metric %s", err)
rpc/dialer.go:
215: golog.Global().Debugw("close referenced conn", "proto", rc.proto)
perf/statz/wrapper.go:
19: golog.Global().Panic("Should never happen where the label lengths do not match")
perf/statz/internal/global.go:
39: golog.Global().Panicf(`Metric %s was already defined and is trying to register again. It may be registered at: %s
rpc/examples/echo/client/main.go:
22:var logger utils.ZapCompatibleLogger = golog.Global().Named("client")
perf/statz/distributions.go:
90: golog.Global().Errorf("faild to write metric %s", err)
perf/statz/example/example.go:
40: golog.Global().Panicf("Failed to start: %s", err)
51: golog.Global().Infof("Latency %d: %f", i, ms)
perf/statz/lables.go:
48: golog.Global().Fatalf("Invalid type to string, should never happen with the type contraints defined.")
58: golog.Global().Fatalf("error creating metric label", err)
perf/statz/statz.go:
255: golog.Global().Panicf("Failed to register metric name not valid: %s", err)
261: golog.Global().Panicf("Failed to register metric label not valid: %s", err)
286: golog.Global().Fatalf("Failed to register the views: %v", err)
298: golog.Global().Panic("Regex failed, this should not happen")
312: golog.Global().Panic("Regex failed, this should not happen")
rpc/examples/echo/server/cmd/main.go:
36: logger utils.ZapCompatibleLogger = golog.Global().Named("server")
Of those search results, I think the only instances that represent non-testing and non-example logs are:
runtime.go:
153: golog.Global().Errorw("panic while running function", "error", err)
158: golog.Global().Infow("waiting a bit to call callback", "wait", waitDur.String())
logger.go
15:var Logger = golog.Global()
perf/statz/summations.go:
96: golog.Global().Errorf("faild to write metric %s", err)
web/templates.go:
219: golog.Global().Errorf("protoJson failed to marshal: %s", err)
perf/statz/counters.go:
112: golog.Global().Errorf("faild to write metric %s", err)
error.go:
65: golog.Global().Desugar().WithOptions(zap.AddCallerSkip(2)).Sugar().Debugw("unchecked error", "error", err)
perf/statz/statztest/recorder.go:
138: golog.Global().Panic("Expected even number of keypairs")
perf/statz/gauges.go:
81: golog.Global().Errorf("faild to write metric %s", err)
rpc/dialer.go:
215: golog.Global().Debugw("close referenced conn", "proto", rc.proto)
perf/statz/wrapper.go:
19: golog.Global().Panic("Should never happen where the label lengths do not match")
perf/statz/internal/global.go:
39: golog.Global().Panicf(`Metric %s was already defined and is trying to register again. It may be registered at: %s
perf/statz/distributions.go:
90: golog.Global().Errorf("faild to write metric %s", err)
perf/statz/example/example.go:
40: golog.Global().Panicf("Failed to start: %s", err)
51: golog.Global().Infof("Latency %d: %f", i, ms)
perf/statz/lables.go:
48: golog.Global().Fatalf("Invalid type to string, should never happen with the type contraints defined.")
58: golog.Global().Fatalf("error creating metric label", err)
perf/statz/statz.go:
255: golog.Global().Panicf("Failed to register metric name not valid: %s", err)
261: golog.Global().Panicf("Failed to register metric label not valid: %s", err)
286: golog.Global().Fatalf("Failed to register the views: %v", err)
298: golog.Global().Panic("Regex failed, this should not happen")
312: golog.Global().Panic("Regex failed, this should not happen")
Note that that var Logger = golog.Global() variable is never referenced, so that declaration is a no-op. The rest of the instances are primarily from metrics-related logic, with some random runtime, error-checking, "template" (app-related auth concept I think), and networking logs thrown in. It's probably best to just use rdk for this random assortment of logs. If they end up being noisy/diagnostic, we can create a separate logger here, but I agree with you that rdk.networking does not seem correct.
TL;DR I'm going to change this line to be golog.ReplaceGloabl(rootLogger.AsZap())
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree with that change
| return | ||
| } | ||
| globalLogger.logger.Info("New log level:", newLevelZap) | ||
| globalLogger.levelChangeLogger.Info("New log level:", newLevelZap) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Err, I'll take the blame for this. I guess this was a perceived bugfix?
Now that I see this more clearly(read: without the other code change distractions), I see why this second logger was introduced onto the globalLogger. I'm less happy about my ask to better name what's now globalLogger.actualGlobalLogger.
No need to change this. These loggers aren't explicitly referenced in many spots. So whatever blemish I perceive is inconsequential. Putting myself in your shoes, I would have tried to brainstorm options for avoiding needing a different logger here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gotcha; I'm gonna leave the names of actualGlobalLogger and levelChangeLogger as-is (IIUC your comment correctly), but I do agree with the "grossness" of globalLogger.actualGlobalLogger 😆 .
web/server/entrypoint.go
Outdated
| golog.ReplaceGloabl(logger.AsZap()) | ||
| // Have goutils use the networking Sublogger of the root logger we've just configured. | ||
| // most of goutils is related to networking logic. | ||
| golog.ReplaceGloabl(networkingLogger.AsZap()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think the change to the networking logger here is correct. In fact I only see one networking log (by number of lines -- not necessarily by number of invocations):
rpc/dialer.go: golog.Global().Debugw("close referenced conn", "proto", rc.proto)
I'm not sure what the remark is suggesting, but I think the question is -- is this "safe" to remove?
If I remember correctly, I had (presumably also) thought we could remove this (and did). But then learned that there were logs in goutils that used the golog global logger. And those logs stopped getting sent to app. I think the most damaging logs to omit were those from panic capturing go. But there's also UncheckedError (and family).
I don't know if those need to be user-facing, so networkingLogger here might be fine. But want to make an intentional choice here.
W.r.t potentially removing this: I think we can only remove this for real once goutils stops relying on the global golog logger.
| golog.ReplaceGloabl(networkingLogger.AsZap()) | ||
|
|
||
| go nc.RunNetworkChecks(ctx, logger, true /* continueRunningTestDNS */) | ||
| go nc.RunNetworkChecks(ctx, rootLogger, true /* continueRunningTestDNS */) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume this is the root logger so it's user facing? Not sure if I agree with that (certainly more interesting to users than weird internal RDK stuff, but not sure they can do much about).
I think being conservative and leaving as rootLogger here is fine. Can we leave a comment reminding readers these are considered user-facing. And clarifying there's no hard opinion on whether that's right? Or maybe there is a hard opinion -- just speculating here to reduce round trips in PR review in case I guessed correctly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume this is the root logger so it's user facing?
While we are passing the rootLogger to RunNetworkChecks, that function immediately creates a sublogger rdk.network-checks which is listed as diagnostic in the app PR. Does that resolve your concern? Or, would a comment of some sort still be helpful?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe a comment here would be good. I've struggled on where to create subloggers that cross method boundaries.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a comment.
I've struggled on where to create subloggers that cross method boundaries.
Same... creating a Sublogger before the method invocation or at the top of the method body seem equally valid to me. I suppose it's more of a question of consistency, but we clearly already have both patterns lying around. Documenting which logger will be used when it's a question of user-facing vs. diagnostic seems like a fair solution for now.
web/server/entrypoint.go
Outdated
| } | ||
| if host == "" || host == "0.0.0.0" || host == "::" { | ||
| s.logger.Warn("binding to all interfaces without authentication") | ||
| s.networkingLogger.Warn("binding to all interfaces without authentication") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nuanced take -- keep this as user-facing. Either as the root logger or probably the config logger. This log has important security implications that users have control over.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Either as the root logger or probably the config logger.
I made the log user-facing under the root logger, as the config logger is diagnostic. Let me know if that seems off. Also left a comment above the log explaining why it's under the root logger.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Understood re: config logger. Root is good
web/server/entrypoint.go
Outdated
| // handling restart checking for us (relevant environment variable is unset). | ||
| if fullProcessedConfig.Cloud != nil && os.Getenv(rutils.ViamAgentHandlesNeedsRestartChecking) == "" { | ||
| s.logger.CInfo(ctx, "Agent does not handle checking needs restart functionality; will handle in server") | ||
| s.networkingLogger.CInfo(ctx, "Agent does not handle checking needs restart functionality; will handle in server") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seemed fine to keep as the rootLogger.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Switched to rootLogger.
|
|
||
| if mustRestart { | ||
| logStackTraceAndCancel(cancel, s.logger) | ||
| logStackTraceAndCancel(cancel, s.rootLogger) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought we wanted to make these diagnostic logs? Perfectly happy to do this in a different ticket (certainly using the available networkLogger is a bit of a semantic stretch -- and this ticket doesn't need to be scoped to include that).
Just double checking we're on the same page about the long-term goal here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought we wanted to make these diagnostic logs?
We do! Cheuk and I agree they should be diagnostic. Mind if I handle that in https://viam.atlassian.net/browse/RSDK-12813?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perfectly fine!
|
|
||
| shutdownCallbackOpt := robotimpl.WithShutdownCallback(func() { | ||
| logStackTraceAndCancel(cancel, s.logger) | ||
| logStackTraceAndCancel(cancel, s.rootLogger) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same remark re: long term goal.
RSDK-12690
Reorganizes/names Subloggers such that the following list of logger name prefixes can be used to check if a logger name is "diagnostic" in app:
"rdk.config""rdk.framesystem""rdk.ftdc""rdk.job_manager""rdk.networking""rdk.network-checks"The changes in this PR are mostly related to creating a new
"rdk.config"logger to move some diagnostic config-related logs out of the user-facing"rdk"logger.The actual message of all logs remains unchanged and no new logs have been added.