Skip to content

Commit 8df6b4d

Browse files
Merge pull request #182 from jonludlam/thin-lvhd-ring-protocol-and-cap
Thin lvhd ring protocol and cap
2 parents 652b44d + 3c11365 commit 8df6b4d

File tree

1 file changed

+164
-2
lines changed

1 file changed

+164
-2
lines changed

xapi/futures/thin-lvhd/thin-lvhd.md

Lines changed: 164 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -510,7 +510,10 @@ The `local_allocator` accepts the following configuration (via a config file):
510510
- `free_pool`: name of the LV used to store the host's free blocks
511511
- `devices`: list of local block devices containing the PVs
512512
- `to_LVM`: name of the LV containing the queue of block allocations sent to `xenvmd`
513-
- `from_LVM`: name of the LV containing the queue of free blocks sent from `xenvmd`
513+
- `from_LVM`: name of the LV containing the queue of messages sent from `xenvmd`.
514+
There are two types of messages:
515+
1. Free blocks to put into the free pool
516+
2. Cap requests to remove blocks from the free pool.
514517

515518
When the `local_allocator` process starts up it will read the host local
516519
journal and
@@ -564,8 +567,10 @@ terminated in `sr_detach`. `xenvmd` has a config file containing:
564567

565568
- peeks updates from all the `to_LVM` queues
566569
- calculates how much free space each host still has
567-
- if the free space for a host drops below some threshold:
570+
- if the size of a host's free pool drops below some threshold:
568571
- choose some free blocks
572+
- if the size of a host's free pool goes above some threshold:
573+
- request a cap of the host's free pool
569574
- writes the change it is going to make to a journal stored in an LV
570575
- pops the updates from the `to_LVM` queues
571576
- pushes the updates to the `from_LVM` queues
@@ -714,3 +719,160 @@ Summary of the impact on the admin
714719
provisioning.
715720
- There will be more fragmentation, but the extent size is large (4MiB) so it
716721
shouldn't be too bad.
722+
723+
Ring protocols
724+
==============
725+
726+
Each ring consists of 3 sectors of metadata followed by the data area. The
727+
contents of the first 3 sectors are:
728+
729+
Sector, Octet offsets | Name | Type | Description
730+
----------------------|-------------|--------|------
731+
0,0-30 | signature | string | Signature ("mirage shared-block-device 1.0")
732+
1,0-7 | producer | uint64 | Pointer to the end of data written by the producer
733+
1,8 | suspend_ack | uint8 | Suspend acknowledgement byte
734+
2,0-7 | consumer | uint64 | Pointer to the end of data read by the consumer
735+
2,8 | suspend | uint8 | Suspend request byte
736+
737+
738+
Note. producer and consumer pointers are stored in little endian
739+
format.
740+
741+
The pointers are free running byte offsets rounded up to the next
742+
4-byte boundary, and the position of the actual data is found by
743+
finding the remainder when dividing by the size of the data area. The
744+
producer pointer points to the first free byte, and the consumer
745+
pointer points to the byte after the last data consumed. The actual
746+
payload is preceded by a 4-byte length field, stored in little endian
747+
format. When writing a 1 byte payload, the next value of the producer
748+
pointer will therefore be 8 bytes on from the previous - 4 for the
749+
length (which will contain [0x01,0x00,0x00,0x00]), 1 byte for the
750+
payload, and 3 bytes padding.
751+
752+
A ring is suspended and resumed by the consumer. To suspend, the
753+
consumer first checks that the producer and consumer agree on the
754+
current suspend status. If they do not, the ring cannot be
755+
suspended. The consumer then writes the byte 0x02 into byte 8 of
756+
sector 2. The consumer must then wait for the producer to acknowledge
757+
the suspend, which it will do by writing 0x02 into byte 8 of sector 1.
758+
759+
The FromLVM ring
760+
----------------
761+
762+
Two different types of message can be sent on the FromLVM ring.
763+
764+
The FreeAllocation message contains the blocks for the free pool.
765+
Example message:
766+
767+
(FreeAllocation((blocks((pv0(12326 12249))(pv0(11 1))))(generation 2)))
768+
769+
Pretty-printed:
770+
771+
(FreeAllocation
772+
(
773+
(blocks
774+
(
775+
(pv0(12326 12249))
776+
(pv0(11 1))
777+
)
778+
)
779+
(generation 2)
780+
)
781+
)
782+
783+
This is a message to add two new sets of extents to the free pool. A
784+
span of length 12249 extents starting at extent 12326, and a span of
785+
length 1 starting from extent 11, both within the physical volume
786+
'pv0'. The generation count of this message is '2'. The semantics of
787+
the generation is that the local allocator must record the generation
788+
of the last message it received since the FromLVM ring was resumed,
789+
and ignore any message with a generated less than or equal to the last
790+
message received.
791+
792+
The CapRequest message contains a request to cap the free pool at
793+
a maximum size.
794+
Example message:
795+
796+
(CapRequest((cap 6127)(name host1-freeme)))
797+
798+
Pretty-printed:
799+
800+
(CapRequest
801+
(
802+
(cap 6127)
803+
(name host1-freeme)
804+
)
805+
)
806+
807+
This is a request to cap the free pool at a maximum size of 6127
808+
extents. The 'name' parameter reflects the name of the LV into which
809+
the extents should be transferred.
810+
811+
The ToLVM Ring
812+
--------------
813+
814+
The ToLVM ring only contains 1 type of message. Example:
815+
816+
((volume test5)(segments(((start_extent 1)(extent_count 32)(cls(Linear((name pv0)(start_extent 12328))))))))
817+
818+
Pretty-printed:
819+
820+
(
821+
(volume test5)
822+
(segments
823+
(
824+
(
825+
(start_extent 1)
826+
(extent_count 32)
827+
(cls
828+
(Linear
829+
(
830+
(name pv0)
831+
(start_extent 12328)
832+
)
833+
)
834+
)
835+
)
836+
)
837+
)
838+
)
839+
840+
This message is extending an LV named 'test5' by giving it 32 extents
841+
starting at extent 1, coming from PV 'pv0' starting at extent
842+
12328. The 'cls' field should always be 'Linear' - this is the only
843+
acceptable value.
844+
845+
846+
Cap requests
847+
============
848+
849+
Xenvmd will try to keep the free pools of the hosts within a range
850+
set as a fraction of free space. There are 3 parameters adjustable
851+
via the config file:
852+
853+
- low_water_mark_factor
854+
- medium_water_mark_factor
855+
- high_water_mark_factor
856+
857+
These three are all numbers between 0 and 1. Xenvmd will sum the free
858+
size and the sizes of all hosts' free pools to find the total
859+
effective free size in the VG, `F`. It will then subtract the sizes of
860+
any pending desired space from in-flight create or resize calls `s`. This
861+
will then be divided by the number of hosts connected, `n`, and
862+
multiplied by the three factors above to find the 3 absolute values
863+
for the high, medium and low watermarks.
864+
865+
{high, medium, low} * (F - s) / n
866+
867+
When xenvmd notices that a host's free pool size has dropped below
868+
the low watermark, it will be topped up such that the size is equal
869+
to the medium watermark. If xenvmd notices that a host's free pool
870+
size is above the high watermark, it will issue a 'cap request' to
871+
the host's local allocator, which will then respond by allocating
872+
from its free pool into the fake LV, which xenvmd will then delete
873+
as soon as it gets the update.
874+
875+
Xenvmd keeps track of the last update it has sent to the local
876+
allocator, and will not resend the same request twice, unless it
877+
is restarted.
878+

0 commit comments

Comments
 (0)