Skip to content

Commit bf0c42e

Browse files
committed
Merge pull request #96 from djs55/thin-lvhd
More stuff about queues
2 parents 28e9894 + f27e10d commit bf0c42e

File tree

2 files changed

+252
-11
lines changed

2 files changed

+252
-11
lines changed

xapi/futures/thin-lvhd/queue.md

Lines changed: 234 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,234 @@
1+
Queues on the shared disk
2+
=========================
3+
4+
The local allocator communicates with the remote allocator via a pair
5+
of queues on the shared disk. Using the disk rather than the network means
6+
that VMs will continue to run even if the management network is not working.
7+
In particular
8+
9+
- if the (management) network fails, VMs continue to run on SAN storage
10+
- if a host changes IP address, nothing needs to be reconfigured
11+
- if xapi fails, VMs continue to run.
12+
13+
Logical messages in the queues
14+
------------------------------
15+
16+
The local allocator needs to tell the remote allocator which blocks have
17+
been allocated to which guest LV. The remote allocator needs to tell the
18+
local allocator which blocks have become free. Since we are based on
19+
LVM, a "block" is an extent, and an "allocation" is a segment i.e. the
20+
placing of a physical extent at a logical extent in the logical volume.
21+
22+
The local allocator needs to send a message with logical contents:
23+
24+
- `volume`: a human-readable name of the LV
25+
- `segments`: a list of LVM segments which says
26+
"place physical extent x at logical extent y using a linear mapping".
27+
28+
Note this message is idempotent.
29+
30+
The remote allocator needs to send a message with logical contents:
31+
32+
- `extents`: a list of physical extents which are free for the host to use
33+
34+
Although
35+
for internal housekeeping the remote allocator will want to assign these
36+
physical extents to logical extents within the host's free LV, the local
37+
allocator doesn't need to know the logical extents. It only needs to know
38+
the set of blocks which it is free to allocate.
39+
40+
Starting up the local allocator
41+
-------------------------------
42+
43+
What happens when a local allocator (re)starts, after a
44+
45+
- process crash, respawn
46+
- host crash, reboot?
47+
48+
When the local-allocator starts up, there are 2 cases:
49+
50+
1. the host has just rebooted, there are no attached disks and no running VMs
51+
2. the process has just crashed, there are attached disks and running VMs
52+
53+
Case 1 is uninteresting. In case 2 there may have been an allocation in
54+
progress when the process crashed and this must be completed. Therefore
55+
the operation is journalled in a local filesystem in a directory which
56+
is deliberately deleted on host reboot (Case 1). The allocation operation
57+
consists of:
58+
59+
1. `push`ing the allocation to the master
60+
2. updating the device mapper
61+
62+
Note that both parts of the allocation operation are idempotent and hence
63+
the whole operation is idempotent. The journalling will guarantee it executes
64+
at-least-once.
65+
66+
When the local-allocator starts up it needs to discover the list of
67+
free blocks. Rather than have 2 code paths, it's best to treat everything
68+
as if it is a cold start (i.e. no local caches already populated) and to
69+
ask the master to resync the free block list. The resync is performed by
70+
executing a "suspend" and "resume" of the free block queue, and requiring
71+
the remote allocator to:
72+
73+
- `pop` all block allocations and incorporate these updates
74+
- send the complete set of free blocks "now" (i.e. while the queue is
75+
suspended) to the local allocator.
76+
77+
Starting the remote allocator
78+
-----------------------------
79+
80+
The remote allocator needs to know
81+
82+
- the device containing the volume group
83+
- the hosts to "connect" to via the shared queues
84+
85+
The device containing the volume group should be written to a config
86+
file when the SR is plugged.
87+
88+
TODO: decide how we should maintain the list of hosts to connect to?
89+
or should we reconnect to all hosts? We probably can discover the metadata
90+
volumes by querying the VG.
91+
92+
Shutting down the local allocator
93+
---------------------------------
94+
95+
The local allocator should be able to crash at any time and recover
96+
afterwards. If the user requests a `PBD.unplug` we can perform a
97+
clean shutdown by:
98+
99+
- signalling the remote allocator to suspend the block allocation queue
100+
- arranging for the local allocator to acknowledge the suspension and exit
101+
- when the remote allocator sees the acknowlegement, we know that the
102+
local allocator is offline and it doesn't need to poll the queue any more
103+
104+
Shutting down the remote allocator
105+
----------------------------------
106+
107+
Shutting down the remote allocator is really a "downgrade": when using
108+
thin provisioning, the remote allocator should be running all the time.
109+
To downgrade, we need to stop all hosts allocating and ensure all updates
110+
are flushed to the global LVM metadata. The remote allocator can shutdown
111+
by:
112+
113+
- shutting down all local allocators (see previous section)
114+
- flushing all outstanding block allocations to the LVM redo log
115+
- flushing the LVM redo log to the global LVM metadata
116+
117+
Queues as rings
118+
---------------
119+
120+
We can use a simple ring protocol to represent the queues on the disk.
121+
Each queue will have a single consumer and single producer and reside within
122+
a single logical volume.
123+
124+
To make diagnostics simpler, we can require the ring to only support `push`
125+
and `pop` of *whole* messages i.e. there can be no partial reads or partial
126+
writes. This means that the `producer` and `consumer` pointers will always
127+
point to valid message boundaries.
128+
129+
One possible format used by the [prototype](https://github.com/mirage/shared-block-ring/blob/master/lib/ring.ml) is as follows:
130+
131+
- sector 0: a magic string
132+
- sector 1: producer state
133+
- sector 2: consumer state
134+
- sector 3...: data
135+
136+
Within the producer state sector we can have:
137+
138+
- octets 0-7: producer offset: a little-endian 64-bit integer
139+
- octet 8: 1 means "suspend acknowledged"; 0 otherwise
140+
141+
Within the consumer state sector we can have:
142+
143+
- octets 0-7: consumer offset: a little-endian 64-bit integer
144+
- octet 8: 1 means "suspend requested"; 0 otherwise
145+
146+
The consumer and producer pointers point to message boundaries. Each
147+
message is prefixed with a 4 byte length and padded to the next 4-byte
148+
boundary.
149+
150+
To push a message onto the ring we need to
151+
152+
- check whether the message is too big to ever fit: this is a permanent
153+
error
154+
- check whether the message is too big to fit given the current free
155+
space: this is a transient error
156+
- write the message into the ring
157+
- advance the producer pointer
158+
159+
To pop a message from the ring we need to
160+
161+
- check whether there is unconsumed space: if not this is a transient
162+
error
163+
- read the message from the ring and process it
164+
- advance the consumer pointer
165+
166+
Journals as queues
167+
------------------
168+
169+
When we journal an operation we want to guarantee to execute it never
170+
*or* at-least-once. We can re-use the queue implementation by `push`ing
171+
a description of the work item to the queue and waiting for the
172+
item to be `pop`ped, processed and finally consumed by advancing the
173+
consumer pointer. The journal code needs to check for unconsumed data
174+
during startup, and to process it before continuing.
175+
176+
Suspending and resuming queues
177+
------------------------------
178+
179+
During startup (resync the free blocks) and shutdown (flush the allocations)
180+
we need to suspend and resume queues. The ring protocol can be extended
181+
to allow the *consumer* to suspend the ring by:
182+
183+
- the consumer asserts the "suspend requested" bit
184+
- the producer `push` function checks the bit and writes "suspend acknowledged"
185+
- the producer also periodically polls the queue state and writes
186+
"suspend acknowledged" (to catch the case where no items are to be pushed)
187+
- after the producer has acknowledged it will guarantee to `push` no more
188+
items
189+
- when the consumer polls the producer's state and spots the "suspend acknowledged",
190+
it concludes that the queue is now suspended.
191+
192+
The key detail is that the handshake on the ring causes the two sides
193+
to synchronise and both agree that the ring is now suspended/ resumed.
194+
195+
196+
Modelling the suspend/resume protocol
197+
-------------------------------------
198+
199+
To check that the suspend/resume protocol works well enough to be used
200+
to resynchronise the free blocks list on a slave, a simple
201+
[promela model](queue.pml) was created. We model the queue state as
202+
2 boolean flags:
203+
204+
```
205+
bool suspend /* suspend requested */
206+
bool suspend_ack /* suspend acknowledged *./
207+
```
208+
209+
and an abstract representation of the data within the ring:
210+
211+
```
212+
/* the queue may have no data (none); a delta or a full sync.
213+
the full sync is performed immediately on resume. */
214+
mtype = { sync delta none }
215+
mtype inflight_data = none
216+
```
217+
218+
There is a "producer" and a "consumer" process which run forever,
219+
exchanging data and suspending and resuming whenever they want.
220+
The special data item `sync` is only sent immediately after a resume
221+
and we check that we never desynchronise with asserts:
222+
223+
```
224+
:: (inflight_data != none) ->
225+
/* In steady state we receive deltas */
226+
assert (suspend_ack == false);
227+
assert (inflight_data == delta);
228+
inflight_data = none
229+
```
230+
i.e. when we are receiving data normally (outside of the suspend/resume
231+
code) we aren't suspended and we expect deltas, not full syncs.
232+
233+
The model-checker [spin](http://spinroot.com/spin/whatispin.html)
234+
verifies this property holds.

xapi/futures/thin-lvhd/queue.pml

Lines changed: 18 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -11,6 +11,20 @@ mtype inflight_data = none
1111

1212
proctype consumer(){
1313

14+
/* get the channel back to a known state by suspending,
15+
resuming and receiving the initial resync */
16+
resync:
17+
(suspend == suspend_ack)
18+
suspend = true;
19+
(suspend == suspend_ack)
20+
resync2:
21+
/* drop old data */
22+
inflight_data = none;
23+
suspend = false;
24+
(suspend == suspend_ack)
25+
(inflight_data == sync)
26+
/* receive initial sync */
27+
inflight_data = none;
1428
do
1529
/* Consumer.pop */
1630
:: (inflight_data != none) ->
@@ -19,18 +33,11 @@ proctype consumer(){
1933
assert (inflight_data == delta);
2034
inflight_data = none
2135
/* Consumer.suspend */
22-
:: (suspend == false) ->
23-
suspend = true;
24-
/* ordering important here */
25-
(suspend_ack == true);
26-
inflight_data = none;
36+
:: ((suspend == false)&&(suspend_ack == false)) ->
37+
goto resync
2738
/* Consumer.resume */
28-
:: (suspend == true) ->
29-
suspend = false;
30-
(suspend_ack == false)
31-
/* Wait for initial resync */
32-
(inflight_data == sync)
33-
inflight_data = none
39+
:: ((suspend == true)&&(suspend_ack == true)) ->
40+
goto resync2
3441
od;
3542
}
3643

0 commit comments

Comments
 (0)