Skip to content

Conversation

@jshearer
Copy link
Contributor

@jshearer jshearer commented Aug 19, 2025

Description:

There was an event early this morning that ended up causing a lot of connection drops. After investigation, I believe it has to do with waiting indefinitely on requests to /authorize/task and /authorize/dekaf. I was not able to find any logs in agent-api indicating long requests that eventually succeeded, but other evidence includes:

  • Dekaf logs complaining about expired tokens: verifying Authorization: token has invalid claims: token is expired. This means that we're attempting to use a journal client with an expired token.

  • Dekaf logs from Read::next_batch() saying second time provided was later than self
    This is coming from SystemTime::duration_since. The only way I could imagine this happening is if Read::new_stream(), which fetches the latest TaskState from its TaskStateListener was to get an expired token. The assumption is that TaskStateListener::get() will either return a valid non-expired token, or an error.

    If the TaskManager's loop were to get stuck waiting on one of the network requests it makes, this assumption would be broken and we would see the behavior that we saw.

So, to fix the problem:

  • I updated TaskManager to:
    • Proactively refresh its tokens earlier, starting 10 minutes before their expiration timestamp.
    • Added timeouts to the network calls, and made it resilient in the face of those timeouts: while the token is still valid, keep retrying and returning the cached token if refresh requests time out.
  • I updated Read to error if, even after attempting to refresh its token, the provided expiration is still in the past.
  • Added TASK_REQUEST_TIMEOUT / --task-request-timeout which allows configuring the TaskManager timeout. I defaulted it to 20s which seems long, but agent-api has a backoff mechanism that includes multi-second delays which are covered under this timeout, and I wanted to avoid false-positive timeouts.

This change is Reviewable

@jshearer jshearer force-pushed the dekaf/improve_timeouts branch 5 times, most recently from 29c58cf to 709ecc5 Compare August 20, 2025 19:21
let current_value = temp_rx.borrow_and_update();
if let Some(ref result) = *current_value {
return result.clone().map_err(anyhow::Error::from);
match &*current_value {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We used to happily hand out expired tokens here, just assuming that the TaskManager loop could keep up. That's the real cause of the problem

fn exp(&self) -> u64 {
match self {
DekafTaskAuth::Redirect { fetched_at, .. } => {
// Redirects are valid for 10 minutes
Copy link
Contributor Author

@jshearer jshearer Aug 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just made this up. Since redirects don't get a token, they don't get an agent-api-specified expiration. 10 minutes seems more than fine, as the only time where this would matter is if a task were migrated such that it is no longer a redirect.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤔 Does this mean that we'll now start to expire Redirects, where previously we haven't? Just wanting to double check whether that poses any risk for existing tasks that may not have encountered that before.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We used to re-fetch every 30s no matter what. Now we only fetch when the (made up) expiration is coming up. I don't think this substantively changes the behavior, just the amount of time a redirect response can be cached for

@jshearer
Copy link
Contributor Author

jshearer commented Aug 20, 2025

This has been running on dekaf-dev successfully for a few hours now and I've done a couple passes over it myself, so I'm saying ready for review on this one.

One thing I want to do before merging is to test the behavior during a migration, as that has a similar effect of causing tokens to be unable to be refreshed for a period of time.

@jshearer jshearer marked this pull request as ready for review August 20, 2025 20:22
@jshearer jshearer self-assigned this Aug 20, 2025
Copy link
Member

@psFried psFried left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couple of questions on this, but nothing major.

@jshearer jshearer force-pushed the dekaf/improve_timeouts branch 2 times, most recently from 0c500bd to 4811675 Compare August 29, 2025 15:30
There was an event early this morning that ended up causing a lot of connection drops. After investigation, I believe it has to do with waiting indefinitely on requests to `/authorize/task` and `/authorize/dekaf`. I was not able to find any logs in agent-api indicating long requests that eventually succeeded, but other evidence includes:

* Dekaf logs complaining about expired tokens: `verifying Authorization: token has invalid claims: token is expired`
* Dekaf logs from `Read::next_batch()` saying `second time provided was later than self`
  * This is coming from `SystemTime::duration_since`. The only way I could imagine this happening is if `new_stream`, which fetches the latest `TaskState` from its `TaskStateListener` was to get an expired token. The assumption is that `TaskStateListener::get()` will either return a valid non-expired token, or an error.

  If the TaskManager's loop were to get stuck waiting on one of the network requests it makes, this assumption would be broken and we would see the behavior that we saw.

So, to fix the problem:
* I updated `TaskManager` to proactively refresh its tokens earlier, added timeouts to the network calls, and made it resilient in the face of those timeouts: while the token is still valid, keep retrying and returning the cached token if refresh requests time out.
* I updated `Read` to be a bit smarter about calculating its timeout
@jshearer jshearer force-pushed the dekaf/improve_timeouts branch from 4811675 to 4c5da99 Compare August 29, 2025 15:38
@jshearer jshearer requested a review from psFried August 29, 2025 16:10
Copy link
Member

@psFried psFried left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

fn exp(&self) -> u64 {
match self {
DekafTaskAuth::Redirect { fetched_at, .. } => {
// Redirects are valid for 10 minutes
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤔 Does this mean that we'll now start to expire Redirects, where previously we haven't? Just wanting to double check whether that poses any risk for existing tasks that may not have encountered that before.

@jshearer jshearer merged commit ce83b4d into master Sep 2, 2025
11 checks passed
jshearer added a commit that referenced this pull request Sep 10, 2025
Since #2348 changed `TaskManager`'s behavior to only call `/authorize/dekaf` ~5 minutes before token expiration, we would only fetch updated task specs that frequently. This is a bug, we need to be refreshing task specs much more frequently than that.

So instead, let's set a fairly short upper bound on how stale materialization specs can be and start trying to refresh after that passes.
jshearer added a commit that referenced this pull request Sep 12, 2025
Since #2348 changed `TaskManager`'s behavior to only call `/authorize/dekaf` ~5 minutes before token expiration, we would only fetch updated task specs that frequently. This is a bug, we need to be refreshing task specs much more frequently than that.

So instead, let's set a fairly short upper bound on how stale materialization specs can be and start trying to refresh after that passes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants