oracle · alan-maguire · Sep 1, 2023 · Aug 28, 2023
diff --git a/docs/bpftune-tcp-conn.rst b/docs/bpftune-tcp-conn.rst
@@ -1,5 +1,5 @@
 ================
-BPFTUNE-TCP-CONG
+BPFTUNE-TCP-CONN
 ================
 --------------------------------------------------------------------------------
 TCP connection bpftune plugin for auto-selection of congestion control algorithm
@@ -12,31 +12,111 @@ DESCRIPTION
 ===========
         The TCP connection algorithm tuner sets congestion control algorithm on
         TCP sockets.  Linux uses cubic by default, and it works well in a wide
-        range of settings, however it can under-perform in lossy networks.
-
-        If we observe retransmits to a remote host, we anticipate more drops
-        to that host may occur; these can lead the default congestion algorithm
-        (cubic) to assume such drops imply congestion, and we end up with a
-        pessimistic congestion algorithm that greatly underperforms with respect
-        to potential bandwitdh.
-
-        With the above in mind, we count retransmission events by remote host,
-        if we see >1% socket retransmits to the host in the last hour, we use
-        BBR as the congestion algorithm instead, anticipating these sorts of
-        losses may result in us under-estimating bandwidth potential.
-
-        Note that BBR retransmits more than other algorithms, so if we switch
-        to it we will likely see more retransmits, and potentially stay with
-        it for a length of time until such losses shake out.
-
-        We use the tracepoint tcp_retransmit_skb to count retransmits by
-        remote host, and a BPF iterator program to set congestion control
-        algorithm, since it allows us to update congestion control for
-        existing connections such as an iSCSI connection, which may exist
-        prior to bpftune starting.  For legacy bpftune - where iterators
-        are not present - we fall back to using tcpbpf, but at a price;
-        only connections that are created after bpftune starts are supported
-        since we need to enable the retransmit sock op.
-
-        Reference: https://blog.apnic.net/2020/01/10/when-to-use-and-not-use-bbr
+        range of settings.
+
+        However, in situations where losses are observed, it can underestimate 
+        network capacity and as a result throughput can drop excessively.  In
+        such cases, BBR is a good fit since it continuously estimates bottleneck
+        bandwidth and attempts to fit the congestion algorithm to it.
+
+        In selecting the appropriate congestion control algorithm, a reinforcement
+        reinforcement learning-based method is used whereby we choose the
+        congestion control algorithm that best fits the optimal bandwidth
+        delay product (BDP)::
+
+         BDP = BottleneckBandwidth * MinRoundTripTime
+
+        The algorithm works as follows; BPF maintains a map of metrics keyed
+        by remote IP address.  For each remote IP address, we track the
+        minimum RTT observed across all TCP connections and the max bandwidth
+        observed.  The former tells us - as closely as we can determine -
+        what the true RTT of the link is.  The latter estimates the
+        bandwidth limit of the link.  Knowing both of these allows us to
+        determine the optimum operating state for a congestion control
+        algorithm, where we feed the pipe enough to reach bandwidth limits but
+        do not overwhelm it.
+
+        Tracking both of these allows us to determine that optimum BDP, so any
+        loss function we use for optimization should drive us towards congestion
+        control algorithms that realize that optimal BDP by being as close
+        as possible to the minimum RTT and as close as possible to the maximum
+        packet delivery rate.  We cannot use raw BDP alone because it is
+        composed of the delivery rate and the RTT, so instead the metric used
+        is::
+
+         (current_min_rtt - overall_min_rtt)*S/overall_min_rtt +
+         (overall_max_delivery_rate - cong_alg_max_delivery_rate)*S/overall_max_delivery_rate
+
+        Both denominators are scaled by a scaling factor S to ensure integer
+        division yields nonzero values.  See ../src/tcp_conn_tuner.h for the
+        metric compuatation.
+
+        Note that while we use the current RTT for the connection, we use the
+        maximum delivery rate observed for the congestion algorithm to compare
+        with the overall maximum.  The reasoning here is that because the
+        delivery rate fluctuates so much for different connections (depending
+        on service type etc), it is too unstable to use it on a per-connection
+        basis. RTT is less variable across connections so we can use the
+        current RTT in metric calcuation.
+
+        For a TCP connection with optimal BDP (minimum RTT + max delivery rate),
+        the loss function yields 0.  Otherwise it yields a positive cost.  This
+        is used to update the cost for that congestion control algorithm via
+        the usual reinforcement learning algorithm, i.e.::
+
+         cong_alg_cost = cong_alg_cost +
+                         learning_rate*(curr_cost - cong_alg_cost)
+
+        We use an epsilon-greedy approach, whereby the vast majority of the time
+        the lowest-cost algorithm is used, but 5% of the time we randomly select
+        an algorithm.  This ensures that if network conditions change we can
+        adapt accordingly - without this, we can get stuck and never discover
+        that another algorithm is doing better.
+
+        How does this work in practice? To benchmark this we tested iperf3
+        performance between network namespaces on the same system, with a 10%
+        loss rate imposed via netem.  What we see is that bpftune converges
+        to using BBR::
+
+         IPAddress      CongAlg     Metric    Count   Greedy   MinRtt MaxRtDlvr
+         192.168.168.1    cubic    2338876        9        9        3     1737
+         192.168.168.1      bbr     923173       61       59        3    10024
+         192.168.168.1     htcp    2318283        5        4        3      620
+         192.168.168.1    dctcp    3506360        3        1        9      160
+
+        Note that we selected the BBR congestion control algorithm 61 out of 78
+        times and its associated cost was less than half of that of other
+        algorithms.  This due to it exhibiting the maximum delivery rate and
+        lowest RTT.
+
+        iperf3 performance also improved as a result of selecting BBR, from a
+        baseline of 58MBit/Sec (running the Linux default cubic algorithm) to
+        490MBit/Sec running bpftune and auto-selecting BBR.
+
+        So this approach appears to find the right answer and converge quickly
+        under loss conditions; what about normal network conditions?
+
+        We might worry that grounding our model in assumptions closely tied to
+        BBR's design might unduly favour BBR in all circumstances; do we see
+        this in practice outside of conditions where BBR is optimal?
+
+        Thankfully no; we see a convergence to dctcp as the optimal congestion
+        control algorithm; again it has the maximum delivery rate and minimum
+        RTT::
+
+         IPAddress      CongAlg     Metric    Count   Greedy   MinRtt MaxRtDlvr
+         192.168.168.1    cubic    1710535        6        4        3     8951
+         192.168.168.1      bbr    2309881        1        1        7      206
+         192.168.168.1     htcp    3333333        3        3        3     8784
+         192.168.168.1    dctcp    1466296       71       70        3     9377
+
+        Note however that it is a close-run thing; the metric for cubic is close
+        and it matches dctcp for minimum RTT (3us) and maximum delivery rate is
+        close (9377 for dctcp, 8951 for cubic).
+
+        References:
+
+        BBR: Congestion-Based Congestion Control
+
+        https://queue.acm.org/detail.cfm?id=3022184
 
diff --git a/include/bpftune/bpftune.bpf.h b/include/bpftune/bpftune.bpf.h
@@ -124,12 +124,13 @@ static __always_inline typeof(name(0)) ____##name(struct pt_regs *ctx, ##args)
 #endif /* BPFTUNE_LEGACY */
 
 #if LIBBPF_DEPRECATED_APIS
-#define BPF_MAP_DEF(_name, _type, _key_size, _value, _max_entries)	\
+#define BPF_MAP_DEF(_name, _type, _key_size, _value, _max_entries, _flags)\
 	struct bpf_map_def SEC("maps") _name = {			\
 		.type = _type,						\
 		.key_size = sizeof(_key),				\
 		.value_size = sizeof(_value),				\
 		.max_entries = _max_entries,				\
+		.map_flags = _flags,					\
 	}
 
 #define BPF_RINGBUF(_name, _max_entries)				\
@@ -138,12 +139,13 @@ static __always_inline typeof(name(0)) ____##name(struct pt_regs *ctx, ##args)
 		.max_entries = _max_entries,				\
 	}
 #else
-#define BPF_MAP_DEF(_name, _type, _key, _value, _max_entries)		\
+#define BPF_MAP_DEF(_name, _type, _key, _value, _max_entries, _flags)	\
         struct {							\
 		__uint(type, _type);					\
 		__type(key, _key);					\
 		__type(value, _value);					\
 		__uint(max_entries, _max_entries);			\
+		__uint(map_flags, _flags);				\
         } _name SEC(".maps")
 
 #define BPF_RINGBUF(_name, _max_entries)				\
@@ -191,10 +193,11 @@ unsigned short bpftune_learning_rate;
 
 #include <bpftune/bpftune.h>
 #include <bpftune/corr.h>
+#include <bpftune/rl.h>
 
 BPF_RINGBUF(ring_buffer_map, 128 * 1024);
 
-BPF_MAP_DEF(netns_map, BPF_MAP_TYPE_HASH, __u64, __u64, 65536);
+BPF_MAP_DEF(netns_map, BPF_MAP_TYPE_HASH, __u64, __u64, 65536, 0);
 
 unsigned int tuner_id;
 unsigned int strategy_id;
@@ -268,6 +271,8 @@ unsigned long bpftune_init_net;
 
 bool debug;
 
+#define __barrier asm volatile("" ::: "memory")
+
 #define bpftune_log(...)	__bpf_printk(__VA_ARGS__)
 #define bpftune_debug(...)	if (debug) __bpf_printk(__VA_ARGS__)
 

diff --git a/include/bpftune/bpftune.h b/include/bpftune/bpftune.h
@@ -59,6 +59,8 @@ extern unsigned short bpftune_learning_rate;
 #define MINUTE				(60 * SECOND)
 #define HOUR				(3600 * SECOND)
 
+#define USEC_PER_SEC			1000000
+
 #define NEARLY_FULL(val, limit) \
 	((val) >= (limit) || (val) + ((limit) >> BPFTUNE_BITSHIFT) >= (limit))
 

diff --git a/include/bpftune/libbpftune.h b/include/bpftune/libbpftune.h
@@ -236,6 +236,12 @@ void bpftuner_tunables_fini(struct bpftuner *tuner);
 	__err;								     \
 })
 
+#define bpftuner_bpf_skel_val(tuner_name, tuner, val)			     \
+	(tuner->bpf_support == BPFTUNE_SUPPORT_NORMAL ?		   	     \
+	 ((struct tuner_name##_tuner_bpf *)tuner->skel)->val :		     \
+	 tuner->bpf_support == BPFTUNE_SUPPORT_LEGACY ?			     \
+	 ((struct tuner_name##_tuner_bpf_legacy *)tuner->skel)->val :	     \
+	 ((struct tuner_name##_tuner_bpf_nobtf *)tuner->skel)->val)
 
 #define bpftuner_bpf_var_set(tuner_name, tuner, var, val)		     \
 	do {								     \
@@ -259,11 +265,10 @@ void bpftuner_tunables_fini(struct bpftuner *tuner);
 	} while (0)
 
 #define bpftuner_bpf_var_get(tuner_name, tuner, var)			     \
-	(tuner->bpf_support == BPFTUNE_SUPPORT_NORMAL ?		     \
-	 ((struct tuner_name##_tuner_bpf *)tuner->skel)->bss->var :    \
-	 tuner->bpf_support == BPFTUNE_SUPPORT_LEGACY ?		     \
-	 ((struct tuner_name##_tuner_bpf_legacy *)tuner->skel)->bss->var :   \
-	 ((struct tuner_name##_tuner_bpf_nobtf *)tuner->skel)->bss->var)
+	bpftuner_bpf_skel_val(tuner_name, tuner, bss->var)
+
+#define bpftuner_bpf_map_get(tuner_name, tuner, map)			     \
+	bpftuner_bpf_skel_val(tuner_name, tuner, maps.map)
 
 enum bpftune_support_level bpftune_bpf_support(void);
 bool bpftune_have_vmlinux_btf(void);

diff --git a/include/bpftune/rl.h b/include/bpftune/rl.h
@@ -0,0 +1,55 @@
+/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
+/*
+ * Copyright (c) 2023, Oracle and/or its affiliates.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public
+ * License v2 as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public
+ * License along with this program; if not, write to the
+ * Free Software Foundation, Inc., 59 Temple Place - Suite 330,
+ * Boston, MA 021110-1307, USA.
+ */
+
+#ifndef _RL_H
+#define _RL_H
+
+#ifdef __KERNEL__
+
+/* choose random state every epsilon states */
+static __always_inline int epsilon_greedy(__u32 greedy_state, __u32 num_states,
+					  __u32 epsilon)
+{
+	__u32 r = bpf_get_prandom_u32();
+
+	if (r % epsilon)
+		return greedy_state;
+	/* need a fresh random number, since we already know r % epsilon == 0. */
+	r = bpf_get_prandom_u32();
+	return r % num_states;
+}
+
+#endif /* __KERNEL__ */
+
+/* simple RL update for a value function; use gain to update value function
+ * using bitshift scaling for learning rate.
+ */
+static __always_inline __u64 rl_update(__u64 value, __u64 gain, __u8 bitshift)
+{
+	if (!value)
+		return gain;
+	if (gain > value)
+		return value + ((gain - value) >> bitshift);
+	else if (gain < value)
+		return value - ((value - gain) >> bitshift);
+	else
+		return value;
+}
+
+#endif /* _RL_H */
diff --git a/src/libbpftune.c b/src/libbpftune.c
@@ -393,7 +393,10 @@ void bpftuner_cgroup_detach(struct bpftuner *tuner, const char *prog_name,
                         bpftune_log(LOG_ERR, "error detaching prog fd %d, cgroup fd %d: %s\n",
                                 prog_fd, cgroup_fd, strerror(-err));
                 }
-        }
+        } else {
+		bpftune_log(LOG_ERR, "bpftuner_cgroup_detach: could not find prog '%s'\n",
+			    prog_name);
+	}
 	bpftune_cap_drop();
 }
 

diff --git a/src/neigh_table_tuner.bpf.c b/src/neigh_table_tuner.bpf.c
@@ -20,7 +20,7 @@
 #include <bpftune/bpftune.bpf.h>
 #include "neigh_table_tuner.h"
 
-BPF_MAP_DEF(tbl_map, BPF_MAP_TYPE_HASH, __u64, struct tbl_stats, 1024);
+BPF_MAP_DEF(tbl_map, BPF_MAP_TYPE_HASH, __u64, struct tbl_stats, 1024, 0);
 
 #ifdef BPFTUNE_LEGACY
 SEC("raw_tracepoint/neigh_create")

diff --git a/src/netns_tuner.bpf.c b/src/netns_tuner.bpf.c
@@ -26,7 +26,7 @@ struct setup_net {
 	struct net *net;
 };
 
-BPF_MAP_DEF(setup_net_map, BPF_MAP_TYPE_HASH, __u64, __u64, 65536);
+BPF_MAP_DEF(setup_net_map, BPF_MAP_TYPE_HASH, __u64, __u64, 65536, 0);
 
 SEC("kprobe/setup_net")
 int BPF_KPROBE(bpftune_setup_net, struct net *net, struct user_namespace *user_ns)

diff --git a/src/probe.bpf.c b/src/probe.bpf.c
@@ -22,7 +22,7 @@
 #include "tcp_conn_tuner.h"
 
 /* probe hash map */
-BPF_MAP_DEF(probe_hash_map, BPF_MAP_TYPE_HASH, __u64, __u64, 65536);
+BPF_MAP_DEF(probe_hash_map, BPF_MAP_TYPE_HASH, __u64, __u64, 65536, 0);
 
 /* probe kprobe/fentry */
 BPF_FENTRY(setup_net, struct net *net, struct user_namespace *user_ns)

diff --git a/src/route_table_tuner.bpf.c b/src/route_table_tuner.bpf.c
@@ -25,7 +25,7 @@ struct dst_net {
 	int entries;
 };
 
-BPF_MAP_DEF(dst_net_map, BPF_MAP_TYPE_HASH, __u64, struct dst_net, 65536);
+BPF_MAP_DEF(dst_net_map, BPF_MAP_TYPE_HASH, __u64, struct dst_net, 65536, 0);
 
 SEC("kprobe/fib6_run_gc")
 int BPF_KPROBE(bpftune_fib6_run_gc_entry, unsigned long expires,

diff --git a/src/tcp_buffer_tuner.bpf.c b/src/tcp_buffer_tuner.bpf.c
@@ -21,7 +21,7 @@
 #include "tcp_buffer_tuner.h"
 #include <bpftune/corr.h>
 
-BPF_MAP_DEF(corr_map, BPF_MAP_TYPE_HASH, struct corr_key, struct corr, 1024);
+BPF_MAP_DEF(corr_map, BPF_MAP_TYPE_HASH, struct corr_key, struct corr, 1024, 0);
 
 bool under_memory_pressure = false;
 bool near_memory_pressure = false;