|
| 1 | +Queues on the shared disk |
| 2 | +========================= |
| 3 | + |
| 4 | +The local allocator communicates with the remote allocator via a pair |
| 5 | +of queues on the shared disk. Using the disk rather than the network means |
| 6 | +that VMs will continue to run even if the management network is not working. |
| 7 | +In particular |
| 8 | + |
| 9 | +- if the (management) network fails, VMs continue to run on SAN storage |
| 10 | +- if a host changes IP address, nothing needs to be reconfigured |
| 11 | +- if xapi fails, VMs continue to run. |
| 12 | + |
| 13 | +Logical messages in the queues |
| 14 | +------------------------------ |
| 15 | + |
| 16 | +The local allocator needs to tell the remote allocator which blocks have |
| 17 | +been allocated to which guest LV. The remote allocator needs to tell the |
| 18 | +local allocator which blocks have become free. Since we are based on |
| 19 | +LVM, a "block" is an extent, and an "allocation" is a segment i.e. the |
| 20 | +placing of a physical extent at a logical extent in the logical volume. |
| 21 | + |
| 22 | +The local allocator needs to send a message with logical contents: |
| 23 | + |
| 24 | +- `volume`: a human-readable name of the LV |
| 25 | +- `segments`: a list of LVM segments which says |
| 26 | + "place physical extent x at logical extent y using a linear mapping". |
| 27 | + |
| 28 | +Note this message is idempotent. |
| 29 | + |
| 30 | +The remote allocator needs to send a message with logical contents: |
| 31 | + |
| 32 | +- `extents`: a list of physical extents which are free for the host to use |
| 33 | + |
| 34 | +Although |
| 35 | +for internal housekeeping the remote allocator will want to assign these |
| 36 | +physical extents to logical extents within the host's free LV, the local |
| 37 | +allocator doesn't need to know the logical extents. It only needs to know |
| 38 | +the set of blocks which it is free to allocate. |
| 39 | + |
| 40 | +Starting up the local allocator |
| 41 | +------------------------------- |
| 42 | + |
| 43 | +What happens when a local allocator (re)starts, after a |
| 44 | + |
| 45 | +- process crash, respawn |
| 46 | +- host crash, reboot? |
| 47 | + |
| 48 | +When the local-allocator starts up, there are 2 cases: |
| 49 | + |
| 50 | +1. the host has just rebooted, there are no attached disks and no running VMs |
| 51 | +2. the process has just crashed, there are attached disks and running VMs |
| 52 | + |
| 53 | +Case 1 is uninteresting. In case 2 there may have been an allocation in |
| 54 | +progress when the process crashed and this must be completed. Therefore |
| 55 | +the operation is journalled in a local filesystem in a directory which |
| 56 | +is deliberately deleted on host reboot (Case 1). The allocation operation |
| 57 | +consists of: |
| 58 | + |
| 59 | +1. `push`ing the allocation to the master |
| 60 | +2. updating the device mapper |
| 61 | + |
| 62 | +Note that both parts of the allocation operation are idempotent and hence |
| 63 | +the whole operation is idempotent. The journalling will guarantee it executes |
| 64 | +at-least-once. |
| 65 | + |
| 66 | +When the local-allocator starts up it needs to discover the list of |
| 67 | +free blocks. Rather than have 2 code paths, it's best to treat everything |
| 68 | +as if it is a cold start (i.e. no local caches already populated) and to |
| 69 | +ask the master to resync the free block list. The resync is performed by |
| 70 | +executing a "suspend" and "resume" of the free block queue, and requiring |
| 71 | +the remote allocator to: |
| 72 | + |
| 73 | +- `pop` all block allocations and incorporate these updates |
| 74 | +- send the complete set of free blocks "now" (i.e. while the queue is |
| 75 | + suspended) to the local allocator. |
| 76 | + |
| 77 | +Starting the remote allocator |
| 78 | +----------------------------- |
| 79 | + |
| 80 | +The remote allocator needs to know |
| 81 | + |
| 82 | +- the device containing the volume group |
| 83 | +- the hosts to "connect" to via the shared queues |
| 84 | + |
| 85 | +The device containing the volume group should be written to a config |
| 86 | +file when the SR is plugged. |
| 87 | + |
| 88 | +TODO: decide how we should maintain the list of hosts to connect to? |
| 89 | +or should we reconnect to all hosts? We probably can discover the metadata |
| 90 | +volumes by querying the VG. |
| 91 | + |
| 92 | +Shutting down the local allocator |
| 93 | +--------------------------------- |
| 94 | + |
| 95 | +The local allocator should be able to crash at any time and recover |
| 96 | +afterwards. If the user requests a `PBD.unplug` we can perform a |
| 97 | +clean shutdown by: |
| 98 | + |
| 99 | +- signalling the remote allocator to suspend the block allocation queue |
| 100 | +- arranging for the local allocator to acknowledge the suspension and exit |
| 101 | +- when the remote allocator sees the acknowlegement, we know that the |
| 102 | + local allocator is offline and it doesn't need to poll the queue any more |
| 103 | + |
| 104 | +Shutting down the remote allocator |
| 105 | +---------------------------------- |
| 106 | + |
| 107 | +Shutting down the remote allocator is really a "downgrade": when using |
| 108 | +thin provisioning, the remote allocator should be running all the time. |
| 109 | +To downgrade, we need to stop all hosts allocating and ensure all updates |
| 110 | +are flushed to the global LVM metadata. The remote allocator can shutdown |
| 111 | +by: |
| 112 | + |
| 113 | +- shutting down all local allocators (see previous section) |
| 114 | +- flushing all outstanding block allocations to the LVM redo log |
| 115 | +- flushing the LVM redo log to the global LVM metadata |
| 116 | + |
| 117 | +Queues as rings |
| 118 | +--------------- |
| 119 | + |
| 120 | +We can use a simple ring protocol to represent the queues on the disk. |
| 121 | +Each queue will have a single consumer and single producer and reside within |
| 122 | +a single logical volume. |
| 123 | + |
| 124 | +To make diagnostics simpler, we can require the ring to only support `push` |
| 125 | +and `pop` of *whole* messages i.e. there can be no partial reads or partial |
| 126 | +writes. This means that the `producer` and `consumer` pointers will always |
| 127 | +point to valid message boundaries. |
| 128 | + |
| 129 | +One possible format used by the [prototype](https://github.com/mirage/shared-block-ring/blob/master/lib/ring.ml) is as follows: |
| 130 | + |
| 131 | +- sector 0: a magic string |
| 132 | +- sector 1: producer state |
| 133 | +- sector 2: consumer state |
| 134 | +- sector 3...: data |
| 135 | + |
| 136 | +Within the producer state sector we can have: |
| 137 | + |
| 138 | +- octets 0-7: producer offset: a little-endian 64-bit integer |
| 139 | +- octet 8: 1 means "suspend acknowledged"; 0 otherwise |
| 140 | + |
| 141 | +Within the consumer state sector we can have: |
| 142 | + |
| 143 | +- octets 0-7: consumer offset: a little-endian 64-bit integer |
| 144 | +- octet 8: 1 means "suspend requested"; 0 otherwise |
| 145 | + |
| 146 | +The consumer and producer pointers point to message boundaries. Each |
| 147 | +message is prefixed with a 4 byte length and padded to the next 4-byte |
| 148 | +boundary. |
| 149 | + |
| 150 | +To push a message onto the ring we need to |
| 151 | + |
| 152 | +- check whether the message is too big to ever fit: this is a permanent |
| 153 | + error |
| 154 | +- check whether the message is too big to fit given the current free |
| 155 | + space: this is a transient error |
| 156 | +- write the message into the ring |
| 157 | +- advance the producer pointer |
| 158 | + |
| 159 | +To pop a message from the ring we need to |
| 160 | + |
| 161 | +- check whether there is unconsumed space: if not this is a transient |
| 162 | + error |
| 163 | +- read the message from the ring and process it |
| 164 | +- advance the consumer pointer |
| 165 | + |
| 166 | +Journals as queues |
| 167 | +------------------ |
| 168 | + |
| 169 | +When we journal an operation we want to guarantee to execute it never |
| 170 | +*or* at-least-once. We can re-use the queue implementation by `push`ing |
| 171 | +a description of the work item to the queue and waiting for the |
| 172 | +item to be `pop`ped, processed and finally consumed by advancing the |
| 173 | +consumer pointer. The journal code needs to check for unconsumed data |
| 174 | +during startup, and to process it before continuing. |
| 175 | + |
| 176 | +Suspending and resuming queues |
| 177 | +------------------------------ |
| 178 | + |
| 179 | +During startup (resync the free blocks) and shutdown (flush the allocations) |
| 180 | +we need to suspend and resume queues. The ring protocol can be extended |
| 181 | +to allow the *consumer* to suspend the ring by: |
| 182 | + |
| 183 | +- the consumer asserts the "suspend requested" bit |
| 184 | +- the producer `push` function checks the bit and writes "suspend acknowledged" |
| 185 | +- the producer also periodically polls the queue state and writes |
| 186 | + "suspend acknowledged" (to catch the case where no items are to be pushed) |
| 187 | +- after the producer has acknowledged it will guarantee to `push` no more |
| 188 | + items |
| 189 | +- when the consumer polls the producer's state and spots the "suspend acknowledged", |
| 190 | + it concludes that the queue is now suspended. |
| 191 | + |
| 192 | +The key detail is that the handshake on the ring causes the two sides |
| 193 | +to synchronise and both agree that the ring is now suspended/ resumed. |
| 194 | + |
| 195 | + |
| 196 | +Modelling the suspend/resume protocol |
| 197 | +------------------------------------- |
| 198 | + |
| 199 | +To check that the suspend/resume protocol works well enough to be used |
| 200 | +to resynchronise the free blocks list on a slave, a simple |
| 201 | +[promela model](queue.pml) was created. We model the queue state as |
| 202 | +2 boolean flags: |
| 203 | + |
| 204 | +``` |
| 205 | +bool suspend /* suspend requested */ |
| 206 | +bool suspend_ack /* suspend acknowledged *./ |
| 207 | +``` |
| 208 | + |
| 209 | +and an abstract representation of the data within the ring: |
| 210 | + |
| 211 | +``` |
| 212 | +/* the queue may have no data (none); a delta or a full sync. |
| 213 | + the full sync is performed immediately on resume. */ |
| 214 | +mtype = { sync delta none } |
| 215 | +mtype inflight_data = none |
| 216 | +``` |
| 217 | + |
| 218 | +There is a "producer" and a "consumer" process which run forever, |
| 219 | +exchanging data and suspending and resuming whenever they want. |
| 220 | +The special data item `sync` is only sent immediately after a resume |
| 221 | +and we check that we never desynchronise with asserts: |
| 222 | + |
| 223 | +``` |
| 224 | + :: (inflight_data != none) -> |
| 225 | + /* In steady state we receive deltas */ |
| 226 | + assert (suspend_ack == false); |
| 227 | + assert (inflight_data == delta); |
| 228 | + inflight_data = none |
| 229 | +``` |
| 230 | +i.e. when we are receiving data normally (outside of the suspend/resume |
| 231 | +code) we aren't suspended and we expect deltas, not full syncs. |
| 232 | + |
| 233 | +The model-checker [spin](http://spinroot.com/spin/whatispin.html) |
| 234 | +verifies this property holds. |
0 commit comments