@@ -510,7 +510,10 @@ The `local_allocator` accepts the following configuration (via a config file):
510510- ` free_pool ` : name of the LV used to store the host's free blocks
511511- ` devices ` : list of local block devices containing the PVs
512512- ` to_LVM ` : name of the LV containing the queue of block allocations sent to ` xenvmd `
513- - ` from_LVM ` : name of the LV containing the queue of free blocks sent from ` xenvmd `
513+ - ` from_LVM ` : name of the LV containing the queue of messages sent from ` xenvmd ` .
514+ There are two types of messages:
515+ 1 . Free blocks to put into the free pool
516+ 2 . Cap requests to remove blocks from the free pool.
514517
515518When the ` local_allocator ` process starts up it will read the host local
516519journal and
@@ -564,8 +567,10 @@ terminated in `sr_detach`. `xenvmd` has a config file containing:
564567
565568- peeks updates from all the ` to_LVM ` queues
566569- calculates how much free space each host still has
567- - if the free space for a host drops below some threshold:
570+ - if the size of a host's free pool drops below some threshold:
568571 - choose some free blocks
572+ - if the size of a host's free pool goes above some threshold:
573+ - request a cap of the host's free pool
569574- writes the change it is going to make to a journal stored in an LV
570575- pops the updates from the ` to_LVM ` queues
571576- pushes the updates to the ` from_LVM ` queues
@@ -714,3 +719,160 @@ Summary of the impact on the admin
714719 provisioning.
715720- There will be more fragmentation, but the extent size is large (4MiB) so it
716721 shouldn't be too bad.
722+
723+ Ring protocols
724+ ==============
725+
726+ Each ring consists of 3 sectors of metadata followed by the data area. The
727+ contents of the first 3 sectors are:
728+
729+ Sector, Octet offsets | Name | Type | Description
730+ ----------------------|-------------|--------|------
731+ 0,0-30 | signature | string | Signature ("mirage shared-block-device 1.0")
732+ 1,0-7 | producer | uint64 | Pointer to the end of data written by the producer
733+ 1,8 | suspend_ack | uint8 | Suspend acknowledgement byte
734+ 2,0-7 | consumer | uint64 | Pointer to the end of data read by the consumer
735+ 2,8 | suspend | uint8 | Suspend request byte
736+
737+
738+ Note. producer and consumer pointers are stored in little endian
739+ format.
740+
741+ The pointers are free running byte offsets rounded up to the next
742+ 4-byte boundary, and the position of the actual data is found by
743+ finding the remainder when dividing by the size of the data area. The
744+ producer pointer points to the first free byte, and the consumer
745+ pointer points to the byte after the last data consumed. The actual
746+ payload is preceded by a 4-byte length field, stored in little endian
747+ format. When writing a 1 byte payload, the next value of the producer
748+ pointer will therefore be 8 bytes on from the previous - 4 for the
749+ length (which will contain [ 0x01,0x00,0x00,0x00] ), 1 byte for the
750+ payload, and 3 bytes padding.
751+
752+ A ring is suspended and resumed by the consumer. To suspend, the
753+ consumer first checks that the producer and consumer agree on the
754+ current suspend status. If they do not, the ring cannot be
755+ suspended. The consumer then writes the byte 0x02 into byte 8 of
756+ sector 2. The consumer must then wait for the producer to acknowledge
757+ the suspend, which it will do by writing 0x02 into byte 8 of sector 1.
758+
759+ The FromLVM ring
760+ ----------------
761+
762+ Two different types of message can be sent on the FromLVM ring.
763+
764+ The FreeAllocation message contains the blocks for the free pool.
765+ Example message:
766+
767+ (FreeAllocation((blocks((pv0(12326 12249))(pv0(11 1))))(generation 2)))
768+
769+ Pretty-printed:
770+
771+ (FreeAllocation
772+ (
773+ (blocks
774+ (
775+ (pv0(12326 12249))
776+ (pv0(11 1))
777+ )
778+ )
779+ (generation 2)
780+ )
781+ )
782+
783+ This is a message to add two new sets of extents to the free pool. A
784+ span of length 12249 extents starting at extent 12326, and a span of
785+ length 1 starting from extent 11, both within the physical volume
786+ 'pv0'. The generation count of this message is '2'. The semantics of
787+ the generation is that the local allocator must record the generation
788+ of the last message it received since the FromLVM ring was resumed,
789+ and ignore any message with a generated less than or equal to the last
790+ message received.
791+
792+ The CapRequest message contains a request to cap the free pool at
793+ a maximum size.
794+ Example message:
795+
796+ (CapRequest((cap 6127)(name host1-freeme)))
797+
798+ Pretty-printed:
799+
800+ (CapRequest
801+ (
802+ (cap 6127)
803+ (name host1-freeme)
804+ )
805+ )
806+
807+ This is a request to cap the free pool at a maximum size of 6127
808+ extents. The 'name' parameter reflects the name of the LV into which
809+ the extents should be transferred.
810+
811+ The ToLVM Ring
812+ --------------
813+
814+ The ToLVM ring only contains 1 type of message. Example:
815+
816+ ((volume test5)(segments(((start_extent 1)(extent_count 32)(cls(Linear((name pv0)(start_extent 12328))))))))
817+
818+ Pretty-printed:
819+
820+ (
821+ (volume test5)
822+ (segments
823+ (
824+ (
825+ (start_extent 1)
826+ (extent_count 32)
827+ (cls
828+ (Linear
829+ (
830+ (name pv0)
831+ (start_extent 12328)
832+ )
833+ )
834+ )
835+ )
836+ )
837+ )
838+ )
839+
840+ This message is extending an LV named 'test5' by giving it 32 extents
841+ starting at extent 1, coming from PV 'pv0' starting at extent
842+ 12328 . The 'cls' field should always be 'Linear' - this is the only
843+ acceptable value.
844+
845+
846+ Cap requests
847+ ============
848+
849+ Xenvmd will try to keep the free pools of the hosts within a range
850+ set as a fraction of free space. There are 3 parameters adjustable
851+ via the config file:
852+
853+ - low_water_mark_factor
854+ - medium_water_mark_factor
855+ - high_water_mark_factor
856+
857+ These three are all numbers between 0 and 1. Xenvmd will sum the free
858+ size and the sizes of all hosts' free pools to find the total
859+ effective free size in the VG, ` F ` . It will then subtract the sizes of
860+ any pending desired space from in-flight create or resize calls ` s ` . This
861+ will then be divided by the number of hosts connected, ` n ` , and
862+ multiplied by the three factors above to find the 3 absolute values
863+ for the high, medium and low watermarks.
864+
865+ {high, medium, low} * (F - s) / n
866+
867+ When xenvmd notices that a host's free pool size has dropped below
868+ the low watermark, it will be topped up such that the size is equal
869+ to the medium watermark. If xenvmd notices that a host's free pool
870+ size is above the high watermark, it will issue a 'cap request' to
871+ the host's local allocator, which will then respond by allocating
872+ from its free pool into the fake LV, which xenvmd will then delete
873+ as soon as it gets the update.
874+
875+ Xenvmd keeps track of the last update it has sent to the local
876+ allocator, and will not resend the same request twice, unless it
877+ is restarted.
878+
0 commit comments