Fix data races in HA when slaves return in the pool. #1387

jeromemaloberti · 2013-07-29T14:37:30Z

In HA, during the recovery process of failed hosts, there were data races due to the conflicts between the liveness information of a returning host and its live status as determined by the xha daemon.
These commits are fixing all the problems found when Xapi_hooks.host_post_declare_dead, which is called on every dead hosts, took few minutes to complete, leaving a large window to the dead hosts to come back in the pool.

mcclurmc · 2013-08-07T16:01:12Z

Don't pull this yet; @jeromemaloberti will be updating the documentation on the commit messages.

thomassa · 2013-09-03T16:02:28Z

I've made a start on reviewing this. The commit-comments all look sensible and so far I've read through the code change of the first commit, which looks good to me.

Three to go...

johnelse · 2013-09-06T11:20:49Z

For bb686f3c783e5f790a05ab8ba9489ac9c41803bf the code change looks OK, but is the commit message correct?

plug_unplugged_pbds is called as part of both a slave and a master's startup sequence.

johnelse · 2013-09-06T11:21:09Z

Otherwise, this all looks good to me.

johnelse · 2013-09-06T14:49:45Z

Confirmed with @jonludlam that the aim of bb686f3c783e5f790a05ab8ba9489ac9c41803bf is to avoid PBD.plug passing through the master, and not to make it master-only. Apart from that commit message I'm happy for this to be merged.

During the return in a pool of a slave host that rebooted, the PBD.plug will fail if the slave is not set as alive. It is anyway an operation that is performed on the master. Signed-off-by: Jerome Maloberti <[email protected]>

In HA the host live value can be different from the liveset determined by xha, allowing a user to start a VM on host that just came back in a pool, but before its recovery process finished. This commit fix this problem by forbidding the live value to be changed outside of HA recovery process, the HA liveset becomes the only liveness state of a host in HA. Signed-off-by: Jerome Maloberti <[email protected]>

Previously, after a host failure in HA, the function Xapi_ha_vm_failover.compute_restart_plan which choose which hosts should restart the failed VMs would pickup host that are live and enabled. In some cases, a host may be live but not in the HA live_set, for example if it returned in the pool before the HA recovery process finished. This situation is bad since the live host would have new VMs started on it, while later it may be marked as dead, once the HA recovery process finished. This commit add the live_set parameter to compute_restart_plan and all functions that need it by transitivity. Some functions are called during the HA recovery, where the live_set is available, or at startup, in this case the live_set is created from all live and enabled hosts. Signed-off-by: Jerome Maloberti <[email protected]>

When some hosts are considered dead in HA, restart_auto_run_vms were processing them in this way: - for each dead host - List all resident VMs - Host.set_live=false - call Xapi_hooks.host_post_declare_dead which can be very long - set all resident VM to `Halted (including Control Domain) This process was conflicting with db_sync if a host had the bad taste of coming back to life while restart_auto_run_vms was stuck in host_post_declare_dead. This commit reorder the actions to put the shortest first: - for each dead host - set all resident VMs excluding Control Domain to `Halted - Host.set_live=false - for each dead host - call Xapi_hooks.host_post_declare_dead Signed-off-by: Jerome Maloberti <[email protected]>

Fix data races in HA when slaves return in the pool.

Jerome Maloberti added 4 commits September 10, 2013 16:31

CA-111637: Change PBD.plug to be a localhost operation.

92be379

During the return in a pool of a slave host that rebooted, the PBD.plug will fail if the slave is not set as alive. It is anyway an operation that is performed on the master. Signed-off-by: Jerome Maloberti <[email protected]>

ghost assigned johnelse Sep 10, 2013

johnelse added a commit that referenced this pull request Sep 10, 2013

Merge pull request #1387 from jeromemaloberti/CA-111637

1ff253f

Fix data races in HA when slaves return in the pool.

johnelse merged commit 1ff253f into xapi-project:master Sep 10, 2013

jeromemaloberti mentioned this pull request Oct 14, 2013

[BOSTON] Fix data races in HA when slaves return in the pool. #1514

Closed

simonjbeaumont mentioned this pull request Oct 17, 2013

[HFX-954] Fix data races in HA when slaves return in the pool. #1518

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix data races in HA when slaves return in the pool. #1387

Fix data races in HA when slaves return in the pool. #1387

Uh oh!

jeromemaloberti commented Jul 29, 2013

Uh oh!

mcclurmc commented Aug 7, 2013

Uh oh!

thomassa commented Sep 3, 2013

Uh oh!

johnelse commented Sep 6, 2013

Uh oh!

johnelse commented Sep 6, 2013

Uh oh!

johnelse commented Sep 6, 2013

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Fix data races in HA when slaves return in the pool. #1387

Fix data races in HA when slaves return in the pool. #1387

Uh oh!

Conversation

jeromemaloberti commented Jul 29, 2013

Uh oh!

mcclurmc commented Aug 7, 2013

Uh oh!

thomassa commented Sep 3, 2013

Uh oh!

johnelse commented Sep 6, 2013

Uh oh!

johnelse commented Sep 6, 2013

Uh oh!

johnelse commented Sep 6, 2013

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants