Skip to content

exploratory: verify session-scoped logs and traces carry SessionId attribute through OpenTelemetry #924

@Aaronontheweb

Description

@Aaronontheweb

Background

Now that session-scoped diagnostics route through SessionLogDispatcher and
land in the per-session session.log (#918), we have a strong in-process
contract: every log line emitted under a populated SessionDiagnosticsContext
carries a session id and ends up filed by session.

For deployments that ship traces and logs to an OpenTelemetry collector
(rather than relying on local files), we want the same property: every
session-scoped event arriving at the collector should carry a session.id
(or equivalent) attribute that operators can pivot, filter, and alert on.

This is exploratory because it is not yet clear which layers consistently
propagate the attribute today.

Why this matters

  • Alerting: "ERROR in session X" should be a fan-in operator alert that
    doesn't require pulling local files off the daemon. Today the OTel side
    doesn't have a guaranteed session.id tag on diagnostic logs from the
    provider plugins, so we cannot build that alert reliably.
  • Reporting: test-lab and customer deployments need rollups by session —
    failure rate per session, slowest session, longest tool call within a
    session. These all key on a structured attribute.
  • Cross-cutting: the same property would let us correlate traces from
    the LLM call, the tool execution, and the channel ingress for a single
    Slack thread.

Scope

Verify (or document gaps in) the following propagation paths under
OpenTelemetry export:

  1. SessionDiagnosticsContext.Push(sessionId) is set at the LLM call
    boundary (per analyzer: flag session-owned chat client calls outside session diagnostics context #915's analyzer). Logs written via MEL ILogger<T>
    inside that scope: do they currently carry session.id in their
    exported attributes? If not, what is the correct shape — a structured
    logging scope, an ActivitySource baggage entry, or both?

  2. Activities started by the daemon (ActivitySource instrumentation in
    Netclaw.Actors.Telemetry, SessionTelemetry, and channel-side
    instrumentation): do they consistently tag session.id and equivalent
    contextual fields (channel.type, model.id, provider.name)?

  3. Channel ingress (Slack, Discord, SignalR, CLI): when a turn starts
    from a channel, is the session.id attached to the originating
    Activity such that downstream child activities inherit it via
    Activity.Current?

  4. Sidecar paths (compaction, title generation, sub-agents, memory
    distillation) — these bypass SessionDiagnosticsContext today (see
    session log: wrap remaining sidecar IChatClient call sites in SessionDiagnosticsContext.Push #920). When emitted through OTel, do they carry the parent
    session.id? Likely not until session log: wrap remaining sidecar IChatClient call sites in SessionDiagnosticsContext.Push #920 lands.

Deliverables

  • Inventory of every place we emit logs or activities that should be
    session-scoped, with current state of session.id tagging.
  • Identified gaps with proposed fixes.
  • A small set of operator-facing alert recipes that use the attribute
    (e.g., "session error rate above threshold", "session N tools called
    exceeded budget").
  • Documentation of the standard attribute name(s) used (likely
    session.id per OTel semconv-leaning naming, plus any Netclaw-specific
    attributes such as netclaw.session.channel, netclaw.session.model).

Acceptance criteria

  • A short report enumerating the propagation gaps.
  • Either a follow-up issue per gap, or a single bundled fix PR if the
    gaps are small enough.
  • The standard attribute names land in docs/spec/configuration.md's
    telemetry section so operators can rely on them.

References

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions