-
-
Notifications
You must be signed in to change notification settings - Fork 465
Profiling - Deduplication and cleanup #4681
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Profiling - Deduplication and cleanup #4681
Conversation
…with cross platform frameworks
🚨 Detected changes in high risk code 🚨High-risk code has higher potential to break the SDK and may be hard to test. To prevent severe bugs, apply the rollout process for releasing such changes and be extra careful when changing and reviewing these files:
|
Instructions and example for changelogPlease add an entry to Example: ## Unreleased
- Profiling - Deduplication and cleanup ([#4681](https://github.com/getsentry/sentry-java/pull/4681))If none of the above apply, you can opt out of this check by adding |
🚨 Detected changes in high risk code 🚨High-risk code has higher potential to break the SDK and may be hard to test. To prevent severe bugs, apply the rollout process for releasing such changes and be extra careful when changing and reviewing these files:
|
Performance metrics 🚀
|
| Revision | Plain | With Sentry | Diff |
|---|---|---|---|
| 42206f4 | 419.73 ms | 494.82 ms | 75.09 ms |
| d6f8356 | 392.08 ms | 454.42 ms | 62.34 ms |
| 066d89d | 377.51 ms | 434.20 ms | 56.69 ms |
| 7dcbbbd | 459.22 ms | 509.85 ms | 50.62 ms |
| f2cd43a | 415.48 ms | 485.28 ms | 69.80 ms |
App size
| Revision | Plain | With Sentry | Diff |
|---|---|---|---|
| 42206f4 | 1.58 MiB | 2.09 MiB | 521.84 KiB |
| d6f8356 | 1.58 MiB | 2.09 MiB | 521.68 KiB |
| 066d89d | 1.58 MiB | 2.09 MiB | 521.85 KiB |
| 7dcbbbd | 1.58 MiB | 2.09 MiB | 521.68 KiB |
| f2cd43a | 1.58 MiB | 2.09 MiB | 521.84 KiB |
…deduplicate stacks
…entry-java into feat/continuous-profiling-03
🚨 Detected changes in high risk code 🚨High-risk code has higher potential to break the SDK and may be hard to test. To prevent severe bugs, apply the rollout process for releasing such changes and be extra careful when changing and reviewing these files:
|
🚨 Detected changes in high risk code 🚨High-risk code has higher potential to break the SDK and may be hard to test. To prevent severe bugs, apply the rollout process for releasing such changes and be extra careful when changing and reviewing these files:
|
|
@sentry review |
🚨 Detected changes in high risk code 🚨High-risk code has higher potential to break the SDK and may be hard to test. To prevent severe bugs, apply the rollout process for releasing such changes and be extra careful when changing and reviewing these files:
|
adinauer
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
WIP
| } | ||
|
|
||
| private long resolveThreadId(int eventThreadId) { | ||
| return jfr.threads.get(eventThreadId) != null |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
l could extract a var here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
m what does it mean if it can't be found in javaThreads? Does it mean it's simply not remapped from the original ID? Could we be reporting garbage data by using the original ID directly if remapping data is missing? e.g. eventThreadId = 9, we can't find it in javaThreads but the real thread ID was actually 117
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, potentially, if the orig. threadId somehow matches a different javaThreadId then we might map it to the wrong thread.
We could set it to a default thread id, like -1 if we can't resolve it. Or drop the event. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sending negative ids seems to work and Sentry Backend correctly matches them if both the Thread in the profile and in the Transaction have negative ids.
So both options seem to be valid. However, I'm not sure if it helps to have a "Fallback" Thread that basically gets all unmapped events, as that graph would not make any sense.
I think discarding the event makes more sense, WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Set threadId to -1 if it cannot be found in the map in follow-up PR: #4746
|
|
||
| @Override | ||
| public void visit(Event event, long value) { | ||
| public void visit(Event event, long samples, long value) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
h any exception here, e.g. when processing a single stack frame would cause the whole chunk to be thrown away if I understand correctly. Is that what we want? Can we only drop broken frames instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How detailed do you think we should do that? just drop the whole event? i.e. catch any exception inside the visit method and just keep going?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
opted to drop the event if an error occurs
.../src/main/java/io/sentry/asyncprofiler/convert/JfrAsyncProfilerToSentryProfileConverter.java
Outdated
Show resolved
Hide resolved
|
|
||
| File profileDir = new File(profilingTracesDirPath); | ||
|
|
||
| if (!profileDir.canWrite() || !profileDir.exists()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
l can we split this up to provide a message that more easily helps pinpoint the problem?
| private @NotNull String filename = ""; | ||
|
|
||
| private final @NotNull AsyncProfiler profiler; | ||
| private @Nullable AsyncProfiler profiler; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
l an alternative to making this nullable could have been to handle the exception in AsyncProfilerContinuousProfilerProvider and simply return a NoOpContinuousProfiler. This would keep the code here simpler.
...y-async-profiler/src/main/java/io/sentry/asyncprofiler/profiling/JavaContinuousProfiler.java
Outdated
Show resolved
Hide resolved
|
|
||
| @SuppressWarnings("ReferenceEquality") | ||
| @Override | ||
| public void startProfiler( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
m should we catch here just in case something goes wrong?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure where we would run into an exception that is not caught, but having a top-level safety net sounds like a good idea.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added exception handling in follow-up: #4746
| @@ -0,0 +1,4 @@ | |||
| # Vendored AsyncProfiler code for converting JFR Files | |||
| - Vendored-in from commit fe1bc66d4b6181413847f6bbe5c0db805f3e9194 of repository: [email protected]:async-profiler/async-profiler.git | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| - Vendored-in from commit fe1bc66d4b6181413847f6bbe5c0db805f3e9194 of repository: git@github.com:async-profiler/async-profiler.git | |
| - Vendored-in from commit https://github.com/async-profiler/async-profiler/tree/fe1bc66d4b6181413847f6bbe5c0db805f3e9194 |
| // No need to check exact decimal places as this depends on JFR precision | ||
| assertTrue( | ||
| timestamp < System.currentTimeMillis() / 1000.0 + 360, | ||
| "Timestamp should be recent", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm confused here. It can't be recent if we use a pre-existing JFR file.
Is the message wrong here?
Why do we need to add 360?
Is this intended for creating the file within the test and asserting it's at most 10s old? Then this test might fail on slow CI.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you're right, this test doesn't make much sense. I'm reworking it
🚨 Detected changes in high risk code 🚨High-risk code has higher potential to break the SDK and may be hard to test. To prevent severe bugs, apply the rollout process for releasing such changes and be extra careful when changing and reviewing these files:
|
🚨 Detected changes in high risk code 🚨High-risk code has higher potential to break the SDK and may be hard to test. To prevent severe bugs, apply the rollout process for releasing such changes and be extra careful when changing and reviewing these files:
|
🚨 Detected changes in high risk code 🚨High-risk code has higher potential to break the SDK and may be hard to test. To prevent severe bugs, apply the rollout process for releasing such changes and be extra careful when changing and reviewing these files:
|
* add skipProfiling flag to TransactionOptions to be able to skip profiling and handle cases where profiling has been started by otel * add profilerId to spanContext so that otel span processor can propagate this to the exporter and SentryTracer * immediately end profiling when stopProfiler is called * bump api, fix android api 24 code * catch all exception happening when converting from jfr * simplify JavaContinuous profiler by catching AsyncProfiler instantiation exceptions in provider * add exists and writable info to log message * add method to safely delete file * remove setNative call * fix test * fix reference to commit we vendored from * drop event if it cannot be processed to not lose the whole chunk * Format code * fix test * Format code * fix test * catch exceptions in startProfiler/stopProfiler * fallback to threadId -1 if it cannot be resolved --------- Co-authored-by: Sentry Github Bot <[email protected]>
🚨 Detected changes in high risk code 🚨High-risk code has higher potential to break the SDK and may be hard to test. To prevent severe bugs, apply the rollout process for releasing such changes and be extra careful when changing and reviewing these files:
|
* delete unused JfrFrame and JfrToSentryProfileconverter * use passed-in profilingTracesHz parameter instead of hardcoded value * start profiler before starting the transaction when ProfileLifecycle.TRACE is used to have the profile ID when SentryTracer is created * use improved way to calculate timestamp of sample * api dump * let profile-lifecycle be set from external_options, add tests for SpringBoot autoconfig * initialize stackTraceFactory only once per chunk * rename profile data classes, add deserialization and tests * extract methods in ProfileConverter, fix SentryProfile serialization and make fields private * use wall=[interval] instead of setting the event to wall and setting the interval separately, this seems to work better and create more samples * start/stop profiler in OtelSentrySpanProcesser in trace mode for root spans * add profiler dependency to jakarta-opentelemetry sample, add needed configs * add dependenies and config to spring-boot-jakarta sample * remove connection status check * extract event visitor * Add enum for ProfileChunk platform * fallback to default temp directory for profiling on jvm if directory is not configured * cleanup some minor things * remove ProfilingInitializer, fix comments * Format code * add getter/setter to sample and metadata * fix compile error * add comment/todo for deleteOnExit * Profiling - Deduplication and cleanup (#4681) * add readme and info about commit of the source repository * delete jfr file on jvm exit * further split into smaller methods * deduplicate frames in order to save bandwidth, add converter tests * remove Platform Enum, use string constants instead for compatibility with cross platform frameworks * implement equals and hashcode for SentryStackFrame to make frame deduplication work * bump api * improve error handling, fix start stop start flow * add new testfile * calculate ticksPerNanosecond in constructor * adapt Ratelimiter to check for both ProfileChunk and ProfileChunkUi ratelimiting * update ratelimiter test to check for both profileChunk and profileChunkUi drops * use string constant instead of string * Format code * add non aggregating event collector to send each event individually, deduplicate stacks * adapt converter tests to new non-aggregated converter * Format code * add logging to loadProfileConverter * Format code * fix duplication of events * catch all exception happening when converting from jfr * add exists and writable info to log message * add method to safely delete file * remove setNative call * fix test * fix reference to commit we vendored from * drop event if it cannot be processed to not lose the whole chunk * make format * fix test * Format code * Profiling - OTEL profiling fix, Stabilization, Logging (#4746) * add skipProfiling flag to TransactionOptions to be able to skip profiling and handle cases where profiling has been started by otel * add profilerId to spanContext so that otel span processor can propagate this to the exporter and SentryTracer * immediately end profiling when stopProfiler is called * bump api, fix android api 24 code * catch all exception happening when converting from jfr * simplify JavaContinuous profiler by catching AsyncProfiler instantiation exceptions in provider * add exists and writable info to log message * add method to safely delete file * remove setNative call * fix test * fix reference to commit we vendored from * drop event if it cannot be processed to not lose the whole chunk * Format code * fix test * Format code * fix test * catch exceptions in startProfiler/stopProfiler * fallback to threadId -1 if it cannot be resolved --------- Co-authored-by: Sentry Github Bot <[email protected]> --------- Co-authored-by: Sentry Github Bot <[email protected]> --------- Co-authored-by: Sentry Github Bot <[email protected]>
* wip * re-add data category profile_chunk, fix json naming, new converter based on jfr converter bundled with asyncprofiler * read java thread ids from jfr and use those instead of os thread ids, use existing SentryStackFrame instead of JfrFrame, * adhere to sentry conventions re format null safety etc, fix compilation * add profile-session-sample-rate to external options * add platform as constructor param to ProfileChunk, wip: set java continuous profiler in sentry init * add doubleToBigDecimal in JfrSample and ProfileChunk, same as we do it in SentrySpan to work around scientific notation of double, use wall clock profiling * rename JfrProfile to SentryProfile * move java profiling into its own module, load using SPI * [WIP] use getProfilingTracesDirPath * add missing build.gradle.kts for profiler module * WIP continuous profiling in trace mode * cleanup unused classes from vendor, cleanup packages * allow setting profiling-traces-dir-path independently from cache dir using external options * use profileChunk.platform to decide how to deal with the chunk instead of file extension * port relevant AndroidContinuousProfilerTest tests to JavaContinuousProfilerTest * add service loader tests for profiler and profile converter * remove old jfr test files * Support ProfileLifecycle.TRACE (#4576) * delete unused JfrFrame and JfrToSentryProfileconverter * use passed-in profilingTracesHz parameter instead of hardcoded value * start profiler before starting the transaction when ProfileLifecycle.TRACE is used to have the profile ID when SentryTracer is created * use improved way to calculate timestamp of sample * api dump * let profile-lifecycle be set from external_options, add tests for SpringBoot autoconfig * initialize stackTraceFactory only once per chunk * rename profile data classes, add deserialization and tests * extract methods in ProfileConverter, fix SentryProfile serialization and make fields private * use wall=[interval] instead of setting the event to wall and setting the interval separately, this seems to work better and create more samples * start/stop profiler in OtelSentrySpanProcesser in trace mode for root spans * add profiler dependency to jakarta-opentelemetry sample, add needed configs * add dependenies and config to spring-boot-jakarta sample * remove connection status check * extract event visitor * Add enum for ProfileChunk platform * fallback to default temp directory for profiling on jvm if directory is not configured * cleanup some minor things * remove ProfilingInitializer, fix comments * Format code * add getter/setter to sample and metadata * fix compile error * add comment/todo for deleteOnExit * Profiling - Deduplication and cleanup (#4681) * add readme and info about commit of the source repository * delete jfr file on jvm exit * further split into smaller methods * deduplicate frames in order to save bandwidth, add converter tests * remove Platform Enum, use string constants instead for compatibility with cross platform frameworks * implement equals and hashcode for SentryStackFrame to make frame deduplication work * bump api * improve error handling, fix start stop start flow * add new testfile * calculate ticksPerNanosecond in constructor * adapt Ratelimiter to check for both ProfileChunk and ProfileChunkUi ratelimiting * update ratelimiter test to check for both profileChunk and profileChunkUi drops * use string constant instead of string * Format code * add non aggregating event collector to send each event individually, deduplicate stacks * adapt converter tests to new non-aggregated converter * Format code * add logging to loadProfileConverter * Format code * fix duplication of events * catch all exception happening when converting from jfr * add exists and writable info to log message * add method to safely delete file * remove setNative call * fix test * fix reference to commit we vendored from * drop event if it cannot be processed to not lose the whole chunk * make format * fix test * Format code * Profiling - OTEL profiling fix, Stabilization, Logging (#4746) * add skipProfiling flag to TransactionOptions to be able to skip profiling and handle cases where profiling has been started by otel * add profilerId to spanContext so that otel span processor can propagate this to the exporter and SentryTracer * immediately end profiling when stopProfiler is called * bump api, fix android api 24 code * catch all exception happening when converting from jfr * simplify JavaContinuous profiler by catching AsyncProfiler instantiation exceptions in provider * add exists and writable info to log message * add method to safely delete file * remove setNative call * fix test * fix reference to commit we vendored from * drop event if it cannot be processed to not lose the whole chunk * Format code * fix test * Format code * fix test * catch exceptions in startProfiler/stopProfiler * fallback to threadId -1 if it cannot be resolved --------- Co-authored-by: Sentry Github Bot <[email protected]> --------- Co-authored-by: Sentry Github Bot <[email protected]> --------- Co-authored-by: Sentry Github Bot <[email protected]> * change profiler api from Path to String to keep Android Api 21 compatibility * mark all profiling related classes as internal * Format code * mark Sentry profile classes internal * Format code * mark SentryThreadMetadata internal * add changelog entry * Format code * re-add asyncprofiler classes, cr changes * bump api * detect dangerous/invalid chars in profiling directory name * ignore vendored profile conversion code in codecov --------- Co-authored-by: Sentry Github Bot <[email protected]>
📜 Description
Follow-up of: #4576
Fixes comments from previous PRs
Fixes method timing sent to sentry by not aggregating JFR events
Fixes RateLimiter Handling for
profile_chunkandprofile_chunk_ui💡 Motivation and Context
💚 How did you test it?
📝 Checklist
sendDefaultPIIis enabled.🔮 Next steps