Skip to content

Conversation

@iamdanfox
Copy link
Contributor

Before this PR

In PDS-117063, a user of our internal atlas-replacement switched from c-j-r -> dialogue and saw server errors.

Looking at the pinuntilerror.nextNode metric, it seemed we switched channels 5 times during a supposedly transactional workflow. This meant that some requests landed on one node and others landed on a different node, which caused the second node to return a hard error.

cc @LucasIME and @jkozlowski

After this PR

==COMMIT_MSG==
PinUntilErrorChannel doesn't switch on 429, to unblock transactional workflows
==COMMIT_MSG==

Possible downsides?

@changelog-app
Copy link

changelog-app bot commented Apr 17, 2020

Generate changelog in changelog/@unreleased

Type

  • Feature
  • Improvement
  • Fix
  • Break
  • Deprecation
  • Manual task
  • Migration

Description

PinUntilErrorChannel doesn't switch on 429, to unblock transactional workflows

Check the box to generate changelog(s)

  • Generate changelog entry

@policy-bot policy-bot bot requested a review from fawind April 17, 2020 11:40
@iamdanfox iamdanfox requested review from carterkozak and ferozco and removed request for fawind April 17, 2020 11:40
live_reloading[UNLIMITED_ROUND_ROBIN].txt: success=60.2% client_mean=PT2.84698S server_cpu=PT1H58M37.45S client_received=2500/2500 server_resps=2500 codes={200=1504, 500=996}
one_big_spike[CONCURRENCY_LIMITER_BLACKLIST_ROUND_ROBIN].txt: success=79.0% client_mean=PT1.478050977S server_cpu=PT1M59.71393673S client_received=1000/1000 server_resps=790 codes={200=790, Failed to make a request=210}
one_big_spike[CONCURRENCY_LIMITER_PIN_UNTIL_ERROR].txt: success=100.0% client_mean=PT1.286733552S server_cpu=PT2M48.75S client_received=1000/1000 server_resps=1125 codes={200=1000}
one_big_spike[CONCURRENCY_LIMITER_PIN_UNTIL_ERROR].txt: success=100.0% client_mean=PT1.135007332S server_cpu=PT2M49.65S client_received=1000/1000 server_resps=1131 codes={200=1000}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may want to update this simulation to respond 429 instead of 503

Copy link
Contributor Author

@iamdanfox iamdanfox Apr 17, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's intended to be representative of this exact workflow, so it responds 429 above some threshold:

 public void one_big_spike() {
        int capacity = 100;
        servers = servers(
                SimulationServer.builder()
                        .serverName("node1")
                        .simulation(simulation)
                        .handler(h -> h.respond200UntilCapacity(429, capacity).responseTime(Duration.ofMillis(150)))
                        .build(),
                SimulationServer.builder()
                        .serverName("node2")
                        .simulation(simulation)
                        .handler(h -> h.respond200UntilCapacity(429, capacity).responseTime(Duration.ofMillis(150)))
                        .build());

@bulldozer-bot bulldozer-bot bot merged commit e6ec9b2 into develop Apr 17, 2020
@bulldozer-bot bulldozer-bot bot deleted the dfox/pin-until-error-fix branch April 17, 2020 11:57
@svc-autorelease
Copy link
Collaborator

Released 1.23.1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants