Skip to content

Timeout in get user with strong option makes subsequent weak get fail [JIRA: RCS-250] #1201

@shino

Description

@shino

Symptom : single slow node may cause user fetch failure.

  • Getting CS user is in two steps, first step with PR=all,
    in which single slow riak node can cause timeout error at client.
  • When timeout occurs in riakc_pb_socket, it disconnects TCP connection
    and goes into wait-and-retry loop.
  • Then, CS user get 2nd phase with weak option, but it's likely that reconnect
    does not happen yet, fails with disconnected error.

If "slow" node is completely frozen (no action will come out from it),
after health check timeout, strong get fails by "insufficient vnodes"
and weak get should work well. For this case, certain user can not
access Riak CS for finite time period, 60 sec by default.


Reproduction (or simulation)

  • Create 4-node cluster ({get_user_timeout, 3000} in advanced.config may help)

  • Memo dev2 pid

    DEV2=`ps aux | grep riak_ee | grep dev1 | grep beam.smp | awk '{print $2;}' `
    
  • Freeze it: kill -s SIGSTOP $DEV2 (keep your fingers crossed, if unfortunate, freeze another node 🙉)

  • Do any access,

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions