Symptom : single slow node may cause user fetch failure.
- Getting CS user is in two steps, first step with PR=all,
in which single slow riak node can cause timeout error at client.
- When timeout occurs in
riakc_pb_socket, it disconnects TCP connection
and goes into wait-and-retry loop.
- Then, CS user get 2nd phase with weak option, but it's likely that reconnect
does not happen yet, fails with disconnected error.
If "slow" node is completely frozen (no action will come out from it),
after health check timeout, strong get fails by "insufficient vnodes"
and weak get should work well. For this case, certain user can not
access Riak CS for finite time period, 60 sec by default.
Reproduction (or simulation)
-
Create 4-node cluster ({get_user_timeout, 3000} in advanced.config may help)
-
Memo dev2 pid
DEV2=`ps aux | grep riak_ee | grep dev1 | grep beam.smp | awk '{print $2;}' `
-
Freeze it: kill -s SIGSTOP $DEV2 (keep your fingers crossed, if unfortunate, freeze another node 🙉)
-
Do any access,
Symptom : single slow node may cause user fetch failure.
in which single slow riak node can cause timeout error at client.
riakc_pb_socket, it disconnects TCP connectionand goes into wait-and-retry loop.
does not happen yet, fails with disconnected error.
If "slow" node is completely frozen (no action will come out from it),
after health check timeout, strong get fails by "insufficient vnodes"
and weak get should work well. For this case, certain user can not
access Riak CS for finite time period, 60 sec by default.
Reproduction (or simulation)
Create 4-node cluster (
{get_user_timeout, 3000}in advanced.config may help)Memo dev2 pid
Freeze it:
kill -s SIGSTOP $DEV2(keep your fingers crossed, if unfortunate, freeze another node 🙉)Do any access,