Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
136 changes: 108 additions & 28 deletions docs/bpftune-tcp-conn.rst
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
================
BPFTUNE-TCP-CONG
BPFTUNE-TCP-CONN
================
--------------------------------------------------------------------------------
TCP connection bpftune plugin for auto-selection of congestion control algorithm
Expand All @@ -12,31 +12,111 @@ DESCRIPTION
===========
The TCP connection algorithm tuner sets congestion control algorithm on
TCP sockets. Linux uses cubic by default, and it works well in a wide
range of settings, however it can under-perform in lossy networks.

If we observe retransmits to a remote host, we anticipate more drops
to that host may occur; these can lead the default congestion algorithm
(cubic) to assume such drops imply congestion, and we end up with a
pessimistic congestion algorithm that greatly underperforms with respect
to potential bandwitdh.

With the above in mind, we count retransmission events by remote host,
if we see >1% socket retransmits to the host in the last hour, we use
BBR as the congestion algorithm instead, anticipating these sorts of
losses may result in us under-estimating bandwidth potential.

Note that BBR retransmits more than other algorithms, so if we switch
to it we will likely see more retransmits, and potentially stay with
it for a length of time until such losses shake out.

We use the tracepoint tcp_retransmit_skb to count retransmits by
remote host, and a BPF iterator program to set congestion control
algorithm, since it allows us to update congestion control for
existing connections such as an iSCSI connection, which may exist
prior to bpftune starting. For legacy bpftune - where iterators
are not present - we fall back to using tcpbpf, but at a price;
only connections that are created after bpftune starts are supported
since we need to enable the retransmit sock op.

Reference: https://blog.apnic.net/2020/01/10/when-to-use-and-not-use-bbr
range of settings.

However, in situations where losses are observed, it can underestimate
network capacity and as a result throughput can drop excessively. In
such cases, BBR is a good fit since it continuously estimates bottleneck
bandwidth and attempts to fit the congestion algorithm to it.

In selecting the appropriate congestion control algorithm, a reinforcement
reinforcement learning-based method is used whereby we choose the
congestion control algorithm that best fits the optimal bandwidth
delay product (BDP)::

BDP = BottleneckBandwidth * MinRoundTripTime

The algorithm works as follows; BPF maintains a map of metrics keyed
by remote IP address. For each remote IP address, we track the
minimum RTT observed across all TCP connections and the max bandwidth
observed. The former tells us - as closely as we can determine -
what the true RTT of the link is. The latter estimates the
bandwidth limit of the link. Knowing both of these allows us to
determine the optimum operating state for a congestion control
algorithm, where we feed the pipe enough to reach bandwidth limits but
do not overwhelm it.

Tracking both of these allows us to determine that optimum BDP, so any
loss function we use for optimization should drive us towards congestion
control algorithms that realize that optimal BDP by being as close
as possible to the minimum RTT and as close as possible to the maximum
packet delivery rate. We cannot use raw BDP alone because it is
composed of the delivery rate and the RTT, so instead the metric used
is::

(current_min_rtt - overall_min_rtt)*S/overall_min_rtt +
(overall_max_delivery_rate - cong_alg_max_delivery_rate)*S/overall_max_delivery_rate

Both denominators are scaled by a scaling factor S to ensure integer
division yields nonzero values. See ../src/tcp_conn_tuner.h for the
metric compuatation.

Note that while we use the current RTT for the connection, we use the
maximum delivery rate observed for the congestion algorithm to compare
with the overall maximum. The reasoning here is that because the
delivery rate fluctuates so much for different connections (depending
on service type etc), it is too unstable to use it on a per-connection
basis. RTT is less variable across connections so we can use the
current RTT in metric calcuation.

For a TCP connection with optimal BDP (minimum RTT + max delivery rate),
the loss function yields 0. Otherwise it yields a positive cost. This
is used to update the cost for that congestion control algorithm via
the usual reinforcement learning algorithm, i.e.::

cong_alg_cost = cong_alg_cost +
learning_rate*(curr_cost - cong_alg_cost)

We use an epsilon-greedy approach, whereby the vast majority of the time
the lowest-cost algorithm is used, but 5% of the time we randomly select
an algorithm. This ensures that if network conditions change we can
adapt accordingly - without this, we can get stuck and never discover
that another algorithm is doing better.

How does this work in practice? To benchmark this we tested iperf3
performance between network namespaces on the same system, with a 10%
loss rate imposed via netem. What we see is that bpftune converges
to using BBR::

IPAddress CongAlg Metric Count Greedy MinRtt MaxRtDlvr
192.168.168.1 cubic 2338876 9 9 3 1737
192.168.168.1 bbr 923173 61 59 3 10024
192.168.168.1 htcp 2318283 5 4 3 620
192.168.168.1 dctcp 3506360 3 1 9 160

Note that we selected the BBR congestion control algorithm 61 out of 78
times and its associated cost was less than half of that of other
algorithms. This due to it exhibiting the maximum delivery rate and
lowest RTT.

iperf3 performance also improved as a result of selecting BBR, from a
baseline of 58MBit/Sec (running the Linux default cubic algorithm) to
490MBit/Sec running bpftune and auto-selecting BBR.

So this approach appears to find the right answer and converge quickly
under loss conditions; what about normal network conditions?

We might worry that grounding our model in assumptions closely tied to
BBR's design might unduly favour BBR in all circumstances; do we see
this in practice outside of conditions where BBR is optimal?

Thankfully no; we see a convergence to dctcp as the optimal congestion
control algorithm; again it has the maximum delivery rate and minimum
RTT::

IPAddress CongAlg Metric Count Greedy MinRtt MaxRtDlvr
192.168.168.1 cubic 1710535 6 4 3 8951
192.168.168.1 bbr 2309881 1 1 7 206
192.168.168.1 htcp 3333333 3 3 3 8784
192.168.168.1 dctcp 1466296 71 70 3 9377

Note however that it is a close-run thing; the metric for cubic is close
and it matches dctcp for minimum RTT (3us) and maximum delivery rate is
close (9377 for dctcp, 8951 for cubic).

References:

BBR: Congestion-Based Congestion Control

https://queue.acm.org/detail.cfm?id=3022184

11 changes: 8 additions & 3 deletions include/bpftune/bpftune.bpf.h
Original file line number Diff line number Diff line change
Expand Up @@ -124,12 +124,13 @@ static __always_inline typeof(name(0)) ____##name(struct pt_regs *ctx, ##args)
#endif /* BPFTUNE_LEGACY */

#if LIBBPF_DEPRECATED_APIS
#define BPF_MAP_DEF(_name, _type, _key_size, _value, _max_entries) \
#define BPF_MAP_DEF(_name, _type, _key_size, _value, _max_entries, _flags)\
struct bpf_map_def SEC("maps") _name = { \
.type = _type, \
.key_size = sizeof(_key), \
.value_size = sizeof(_value), \
.max_entries = _max_entries, \
.map_flags = _flags, \
}

#define BPF_RINGBUF(_name, _max_entries) \
Expand All @@ -138,12 +139,13 @@ static __always_inline typeof(name(0)) ____##name(struct pt_regs *ctx, ##args)
.max_entries = _max_entries, \
}
#else
#define BPF_MAP_DEF(_name, _type, _key, _value, _max_entries) \
#define BPF_MAP_DEF(_name, _type, _key, _value, _max_entries, _flags) \
struct { \
__uint(type, _type); \
__type(key, _key); \
__type(value, _value); \
__uint(max_entries, _max_entries); \
__uint(map_flags, _flags); \
} _name SEC(".maps")

#define BPF_RINGBUF(_name, _max_entries) \
Expand Down Expand Up @@ -191,10 +193,11 @@ unsigned short bpftune_learning_rate;

#include <bpftune/bpftune.h>
#include <bpftune/corr.h>
#include <bpftune/rl.h>

BPF_RINGBUF(ring_buffer_map, 128 * 1024);

BPF_MAP_DEF(netns_map, BPF_MAP_TYPE_HASH, __u64, __u64, 65536);
BPF_MAP_DEF(netns_map, BPF_MAP_TYPE_HASH, __u64, __u64, 65536, 0);

unsigned int tuner_id;
unsigned int strategy_id;
Expand Down Expand Up @@ -268,6 +271,8 @@ unsigned long bpftune_init_net;

bool debug;

#define __barrier asm volatile("" ::: "memory")

#define bpftune_log(...) __bpf_printk(__VA_ARGS__)
#define bpftune_debug(...) if (debug) __bpf_printk(__VA_ARGS__)

Expand Down
2 changes: 2 additions & 0 deletions include/bpftune/bpftune.h
Original file line number Diff line number Diff line change
Expand Up @@ -59,6 +59,8 @@ extern unsigned short bpftune_learning_rate;
#define MINUTE (60 * SECOND)
#define HOUR (3600 * SECOND)

#define USEC_PER_SEC 1000000

#define NEARLY_FULL(val, limit) \
((val) >= (limit) || (val) + ((limit) >> BPFTUNE_BITSHIFT) >= (limit))

Expand Down
15 changes: 10 additions & 5 deletions include/bpftune/libbpftune.h
Original file line number Diff line number Diff line change
Expand Up @@ -236,6 +236,12 @@ void bpftuner_tunables_fini(struct bpftuner *tuner);
__err; \
})

#define bpftuner_bpf_skel_val(tuner_name, tuner, val) \
(tuner->bpf_support == BPFTUNE_SUPPORT_NORMAL ? \
((struct tuner_name##_tuner_bpf *)tuner->skel)->val : \
tuner->bpf_support == BPFTUNE_SUPPORT_LEGACY ? \
((struct tuner_name##_tuner_bpf_legacy *)tuner->skel)->val : \
((struct tuner_name##_tuner_bpf_nobtf *)tuner->skel)->val)

#define bpftuner_bpf_var_set(tuner_name, tuner, var, val) \
do { \
Expand All @@ -259,11 +265,10 @@ void bpftuner_tunables_fini(struct bpftuner *tuner);
} while (0)

#define bpftuner_bpf_var_get(tuner_name, tuner, var) \
(tuner->bpf_support == BPFTUNE_SUPPORT_NORMAL ? \
((struct tuner_name##_tuner_bpf *)tuner->skel)->bss->var : \
tuner->bpf_support == BPFTUNE_SUPPORT_LEGACY ? \
((struct tuner_name##_tuner_bpf_legacy *)tuner->skel)->bss->var : \
((struct tuner_name##_tuner_bpf_nobtf *)tuner->skel)->bss->var)
bpftuner_bpf_skel_val(tuner_name, tuner, bss->var)

#define bpftuner_bpf_map_get(tuner_name, tuner, map) \
bpftuner_bpf_skel_val(tuner_name, tuner, maps.map)

enum bpftune_support_level bpftune_bpf_support(void);
bool bpftune_have_vmlinux_btf(void);
Expand Down
55 changes: 55 additions & 0 deletions include/bpftune/rl.h
Original file line number Diff line number Diff line change
@@ -0,0 +1,55 @@
/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
/*
* Copyright (c) 2023, Oracle and/or its affiliates.
*
* This program is free software; you can redistribute it and/or
* modify it under the terms of the GNU General Public
* License v2 as published by the Free Software Foundation.
*
* This program is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
* General Public License for more details.
*
* You should have received a copy of the GNU General Public
* License along with this program; if not, write to the
* Free Software Foundation, Inc., 59 Temple Place - Suite 330,
* Boston, MA 021110-1307, USA.
*/

#ifndef _RL_H
#define _RL_H

#ifdef __KERNEL__

/* choose random state every epsilon states */
static __always_inline int epsilon_greedy(__u32 greedy_state, __u32 num_states,
__u32 epsilon)
{
__u32 r = bpf_get_prandom_u32();

if (r % epsilon)
return greedy_state;
/* need a fresh random number, since we already know r % epsilon == 0. */
r = bpf_get_prandom_u32();
return r % num_states;
}

#endif /* __KERNEL__ */

/* simple RL update for a value function; use gain to update value function
* using bitshift scaling for learning rate.
*/
static __always_inline __u64 rl_update(__u64 value, __u64 gain, __u8 bitshift)
{
if (!value)
return gain;
if (gain > value)
return value + ((gain - value) >> bitshift);
else if (gain < value)
return value - ((value - gain) >> bitshift);
else
return value;
}

#endif /* _RL_H */
5 changes: 4 additions & 1 deletion src/libbpftune.c
Original file line number Diff line number Diff line change
Expand Up @@ -393,7 +393,10 @@ void bpftuner_cgroup_detach(struct bpftuner *tuner, const char *prog_name,
bpftune_log(LOG_ERR, "error detaching prog fd %d, cgroup fd %d: %s\n",
prog_fd, cgroup_fd, strerror(-err));
}
}
} else {
bpftune_log(LOG_ERR, "bpftuner_cgroup_detach: could not find prog '%s'\n",
prog_name);
}
bpftune_cap_drop();
}

Expand Down
2 changes: 1 addition & 1 deletion src/neigh_table_tuner.bpf.c
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@
#include <bpftune/bpftune.bpf.h>
#include "neigh_table_tuner.h"

BPF_MAP_DEF(tbl_map, BPF_MAP_TYPE_HASH, __u64, struct tbl_stats, 1024);
BPF_MAP_DEF(tbl_map, BPF_MAP_TYPE_HASH, __u64, struct tbl_stats, 1024, 0);

#ifdef BPFTUNE_LEGACY
SEC("raw_tracepoint/neigh_create")
Expand Down
2 changes: 1 addition & 1 deletion src/netns_tuner.bpf.c
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,7 @@ struct setup_net {
struct net *net;
};

BPF_MAP_DEF(setup_net_map, BPF_MAP_TYPE_HASH, __u64, __u64, 65536);
BPF_MAP_DEF(setup_net_map, BPF_MAP_TYPE_HASH, __u64, __u64, 65536, 0);

SEC("kprobe/setup_net")
int BPF_KPROBE(bpftune_setup_net, struct net *net, struct user_namespace *user_ns)
Expand Down
2 changes: 1 addition & 1 deletion src/probe.bpf.c
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@
#include "tcp_conn_tuner.h"

/* probe hash map */
BPF_MAP_DEF(probe_hash_map, BPF_MAP_TYPE_HASH, __u64, __u64, 65536);
BPF_MAP_DEF(probe_hash_map, BPF_MAP_TYPE_HASH, __u64, __u64, 65536, 0);

/* probe kprobe/fentry */
BPF_FENTRY(setup_net, struct net *net, struct user_namespace *user_ns)
Expand Down
2 changes: 1 addition & 1 deletion src/route_table_tuner.bpf.c
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,7 @@ struct dst_net {
int entries;
};

BPF_MAP_DEF(dst_net_map, BPF_MAP_TYPE_HASH, __u64, struct dst_net, 65536);
BPF_MAP_DEF(dst_net_map, BPF_MAP_TYPE_HASH, __u64, struct dst_net, 65536, 0);

SEC("kprobe/fib6_run_gc")
int BPF_KPROBE(bpftune_fib6_run_gc_entry, unsigned long expires,
Expand Down
2 changes: 1 addition & 1 deletion src/tcp_buffer_tuner.bpf.c
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@
#include "tcp_buffer_tuner.h"
#include <bpftune/corr.h>

BPF_MAP_DEF(corr_map, BPF_MAP_TYPE_HASH, struct corr_key, struct corr, 1024);
BPF_MAP_DEF(corr_map, BPF_MAP_TYPE_HASH, struct corr_key, struct corr, 1024, 0);

bool under_memory_pressure = false;
bool near_memory_pressure = false;
Expand Down
Loading