Skip to content

Conversation

@qinfchen
Copy link
Contributor

@qinfchen qinfchen commented Sep 6, 2018

Before this PR

People might start adopting our shiny new palantir/tracing-java libraries, while keeping remoting3 ones on the classpath. This would be very bad because there would be two ThreadLocals trying to keep track of the current trace:

  • com.palantir.remoting3.tracing.Tracer <-- one ThreadLocal in here
  • com.palantir.tracing.Tracer <-- another ThreadLocal in here!

Concretely, all the clients in Atlas would use the remoting3 tracers but newer user-code would likely use tracing-java. This would make tracing information far less useful for debugging.

After this PR

com.palantir.remoting3.tracing.Tracer no longer maintains its own ThreadLocal, instead it delegates everything to com.palantir.tracing.Tracer.

Caveats

  • 🚨The return type of Tracer#subscribe(String, SpanObserver) behaves differently now (it used to preserve reference equality and it no longer does).
  • 🚨The Tracer#getAndClearTrace returns an incomplete copy of the trace. I think this is OK because that method was never usable externally because the returned type was not public.
  • Should there be a way of detecting this delegating functionality is on the classpath?

When migrating tritium to the rebranded tracing, @j-baker and @schlosna pointed out that that tracers are threadlocal and could potentially break tracing if remoting-tracers and conjure tracers were not on the same thread.
One proposal suggested by @iamdanfox was to delegate remotings tracing calls to conjure tracers. This pr is to demonstrate how the implementation would look like.
See more details here: palantir/tritium#100.

@qinfchen qinfchen requested a review from a team as a code owner September 6, 2018 00:42
}

@Override
public void doConsume(Span span) {
Copy link
Contributor

@iamdanfox iamdanfox Sep 6, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this was an implementation detail of the abstract AsyncSpanObserver class (which we no longer extend), so I think it's ok to delete.

@iamdanfox iamdanfox force-pushed the qchen/delegateTracing branch from 9cad433 to b009613 Compare September 6, 2018 14:28
@JsonSerialize(as = ImmutableZipkinCompatSpan.class)
@Value.Immutable
@Value.Style(visibility = Value.Style.ImplementationVisibility.PACKAGE)
abstract static class ZipkinCompatSpan {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These classes have been completely deleted because they're an implementation detail that now exists in the new tracing-java class: com.palantir.tracing.AsyncSlf4jSpanObserver

private Tracer() {}

// Thread-safe since thread-local
private static final ThreadLocal<Trace> currentTrace = ThreadLocal.withInitial(() -> {
Copy link
Contributor

@iamdanfox iamdanfox Sep 6, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the critical line! remoting3 no longer maintains any ThreadLocal state, instead it's all delegated to the shiny new tracing-java.

@iamdanfox iamdanfox changed the title [WIP]delegate remoting tracing calls to java tracing No more ThreadLocals, delegate everything to 'palantir/tracing-java' Sep 6, 2018
@iamdanfox iamdanfox force-pushed the qchen/delegateTracing branch from b009613 to bb252e2 Compare September 6, 2018 14:36
currentTrace.remove();
MDC.remove(Tracers.TRACE_ID_KEY);
return trace;
com.palantir.tracing.Trace trace = com.palantir.tracing.Tracer.getAndClearTrace();
Copy link
Contributor Author

@qinfchen qinfchen Sep 6, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@iamdanfox, is this possible that we do something like this instead of having the ExposedTrace that lives inside the tracing-java package?

-        com.palantir.tracing.Trace trace = com.palantir.tracing.Tracer.getAndClearTrace();
-        return new Trace(ExposedTrace.isObservable(trace), ExposedTrace.getTraceId(trace));
+        boolean isObservable = com.palantir.tracing.Tracer.isTraceObservable();
+        String traceId = com.palantir.tracing.Tracer.getTraceId();
+        com.palantir.tracing.Tracer.getAndClearTrace();
+        return new Trace(isObservable, traceId);

Copy link
Contributor

@iamdanfox iamdanfox Sep 6, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there's a bit of a race condition, where we might get the isTraceObservable info, then someone changes stuff beneath our feet, then we get the traceId, then finally we clear it...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, that is not gonna work.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In every case in this impl we should be delegating all state :)

@iamdanfox iamdanfox force-pushed the qchen/delegateTracing branch from f403597 to badb51c Compare September 6, 2018 14:46
*/
public static CloseableTracer startSpan(String operation) {
Tracer.startSpan(operation);
com.palantir.tracing.Tracer.startSpan(operation);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we just be like com.palantir.tracing.CloseableTracer.startSpan?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i kind of prefer using com.palantir.tracing.Tracer.startSpan(operation); to match up against com.palantir.tracing.Tracer.fastCompleteSpan(); few lines below since we can't get the instance of the com.palantir.tracing.CloseableTracer.

/** Utility methods for making {@link ExecutorService} and {@link Runnable} instances tracing-aware. */
public final class Tracers {
/** The key under which trace ids are inserted into SLF4J {@link org.slf4j.MDC MDCs}. */
public static final String TRACE_ID_KEY = "traceId";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should delete public static final fields like this, because if some consumer is depending on com.palantir.remoting3.tracingTracers.TRACE_ID_KEY then they'd suddenly break?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it doesn't seem like anyone is using it, https://github.com/search?p=2&q=Tracers.TRACE_ID_KEY&type=Code
i also did a quick search on the key internally and no internal consumer is using it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still don't think we should remove things like this when there isn't a clear motivation - what if there was a user that didn't appear in your search? Did you check OSS consumers too (e.g. atlasdb?)

* not thread-safe and is intended to be used in a thread-local context.
*/
final class Trace {
public final class Trace {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this need to become public?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without this being public, one of the test utils fails to compile.

screen shot 2018-09-06 at 17 26 13

I find an alternative that involved some objects and casting too!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is also kind of odd that we have a public static method that returns Trace and Trace is private.
https://github.com/palantir/http-remoting/pull/799/files#diff-7ee517b85058e51de91b9bf446ff7585L246

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@qinfchen I don't have the original context, but one possible justification could be that this Trace class is useful for testing but shouldn't be used by consumers??

@qinfchen qinfchen requested a review from schlosna September 6, 2018 16:38
@iamdanfox iamdanfox force-pushed the qchen/delegateTracing branch from 1968b2f to adf0f74 Compare September 7, 2018 12:15
@iamdanfox iamdanfox merged commit 5755d6f into develop Sep 7, 2018
@iamdanfox iamdanfox deleted the qchen/delegateTracing branch September 7, 2018 12:20
dansanduleac pushed a commit that referenced this pull request Sep 13, 2018
* Expose HostMetricsRegistry record methods (#780)

This is so that we can separately implement a HostMetricsSink for it (see #779) such that we can share host metrics when constructing both conjure and remoting3 clients

* publish BOM for jaxrs-client (#789)

* Excavator: Render CircleCI file using template specified in .circleci/template.sh (#791)

* Upgrade OkHttp to 3.11.0. (#794)

* AssertionErrors are converted into service exceptions with type internal (#727)

* No more ThreadLocals, delegate everything to 'palantir/tracing-java' (#799)

* Use BINTRAY_*_REMOTING envs (#802)

The project's default bintray creds are currently set up to publish to `conjure-java-runtime`.
Use these custom env vars to maintain ability to publish http-remoting.

* Better behaviour in the presence of 429s (#786)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants