Skip to content

Commit 1849b89

Browse files
mg12ctxlindig
authored andcommitted
CA-273775: remove race in vgpu_receiver_sync during vm migration
During a VM migration, the new receive_vgpu thread races with the original receive_memory thread in the receiving host. A new 'Synchronisation point 1-vgpu' was meant to indicate to the sending host that the table vgpu_receiver_sync had been initialised in the receiving host, and therefore it should be safe for both sending and receiving hosts to go past the original 'Synchronisation point 1' and start streaming the VM state. The problem with going past this original point 1 is that the table vgpu_receiver_sync is used and if still uninitialised it would result in the migration proceeding without the vgpu stream information. However, the original 'Synchronisation point 1' only blocks the sending host, not the receiving host. This means that the new 'Synchronisation point 1-vgpu' signal was just half of the necessary signalling infrastructure to protect the use of the table vgpu_receiver_sync, asserting only to the sending host that the original 'Synchronisation point 1' comes after 'Synchronisation point 1-vgpu'. The receiving host's receive_memory thread is still free to race after 'Synchronisation point 1' before the 'Synchronisation point 1-vgpu' is reached on the receiving host's receive_vgpu thread. This patch adds a new 1-vgpu ACK signal, sent by the sending host just after the 'Synchronisation point 1-vgpu' is reached, which the receiving host's receive_memory thread will wait, before the table vgpu_receiver_sync is used. Therefore, after this patch, both the sending and the receiving host will know that the table vgpu_receiver_sync has been initialised and is ready to be used after they get past the 1-vgpu ACK point. This is an invasive change because it changes the xenopsd VM migration protocol, and incompatible with the previous protocol: VMs using previous xenopsd versions cannot be migrated after this change. Therefore, the change only affects the VM migration protocol when a VGPU is present. This means that: - if a VGPU is not present, the VM migration still works from older xenopsd, so it's backwards compatible. - if a VGPU is present, the new protocol will not work with older xenopsd, so it's not backwards compatible. Since this VGPU-migration is a new feature, this is not a problem, because it's not present in a supported manner in older versions of xenopsd. Signed-off-by: Marcus Granado <[email protected]>
1 parent a619065 commit 1849b89

File tree

1 file changed

+17
-0
lines changed

1 file changed

+17
-0
lines changed

lib/xenops_server.ml

Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1660,6 +1660,8 @@ and perform_exn ?subtask ?result (op: operation) (t: Xenops_task.task_handle) :
16601660
do_request vgpu_fd [] vgpu_url;
16611661
Handshake.recv_success vgpu_fd;
16621662
debug "VM.migrate: Synchronisation point 1-vgpu";
1663+
Handshake.send ~verbose:true mem_fd Handshake.Success;
1664+
debug "VM.migrate: Synchronisation point 1-vgpu ACK";
16631665
first_handshake ();
16641666
save ~vgpu_fd:(FD vgpu_fd) ();
16651667
);
@@ -1682,6 +1684,21 @@ and perform_exn ?subtask ?result (op: operation) (t: Xenops_task.task_handle) :
16821684
debug "VM.receive_memory creating domain and restoring VIFs";
16831685

16841686
finally (fun ()->
1687+
1688+
(* If we have a vGPU, wait for the vgpu-1 ACK, which indicates that the vgpu_receiver_sync entry for
1689+
this vm id has already been initialised by the parallel receive_vgpu thread in this receiving host
1690+
*)
1691+
(match VGPU_DB.ids id with
1692+
| [] -> ()
1693+
| _ -> begin
1694+
Handshake.recv_success s;
1695+
debug "VM.receive_memory: Synchronisation point 1-vgpu ACK";
1696+
(* After this point, vgpu_receiver_sync is initialised by the corresponding receive_vgpu thread
1697+
and therefore can be used by this VM_receive_memory thread
1698+
*)
1699+
end
1700+
);
1701+
16851702
(try
16861703
perform_atomics (
16871704
simplify [VM_create (id, Some memory_limit);] @

0 commit comments

Comments
 (0)