Shire is a new approach to designing FPGA-accelerated middleboxes that simplifies development, debugging, and performance tuning by decoupling the tasks of hardware accelerator implementation and software application programming. Shire is a framework that links hardware accelerators to a high-performance packet processing pipeline through a standardized hardware/software interface. This separation of concerns allows hardware developers to focus on optimizing custom accelerators while freeing software programmers to reuse, configure, and debug accelerators in a fashion akin to software libraries. We show the benefits of Shire framework can be seen through two examples: a firewall based on a large blacklist, and porting the Pigasus IDS pattern-matching accelerator, together in less than a month. Our experiments demonstrate Rosebud delivers high performance, serving ∼200 Gbps of traffic while adding only 0.7–7 microseconds of latency.
More information can be found in our paper: https://arxiv.org/abs/2201.08978
To build FPGA images we used Vivado 2021.1*. Also, we need licenses for pcie_ultra_plus and CMAC hard IPs. To compile programs for riscv, we use riscv-gcc. For Arch linux you can use pacman, for Ubuntu:
sudo apt-get install autoconf automake autotools-dev curl python3 libmpc-dev libmpfr-dev libgmp-dev gawk build-essential bison flex texinfo gperf libtool patchutils bc zlib1g-dev libexpat-dev
git clone [email protected]:riscv/riscv-gnu-toolchain.git
cd riscv-gnu-toolchain/
./configure --prefix=/opt/riscv --enable-multilib
sudo make -j 32
Then add /opt/riscv/bin to the PATH. You can change the path by changing the prefix option during compile. (sudo in the last line is if you don’t have write permission to /opt)
To do Partial Reconfiguration from Linux, we need MCAP driver, in addition to the provided driver in the repo. It can be acquired from:
https://github.com/ucsdsysnet/Shire/tree/master/host_utils/runtime/mcap
For running the python based simulation infrastructure, in addition to Python 3 we need two additional software. For connecting Python to RTL simulator we used Cocotb which can be installed by:
pip install cocotb
For RTL simulation, Synopsys VCS, Intel Questa, and Icarus Verilog are supported by cocotb. Icarus Verilog is free and can be obtained from
https://github.com/steveicarus/iverilog
To install Icarus Verilog, follow the instructions from the git repository, or simply:
git clone https://github.com/steveicarus/iverilog
cd iverilog
sh ./autoconf.sh
./configure
make
sudo make install
* Vivado 2021.2 and 2022.1 use new placement and routing algorithms that break timing.
The method to generate the image is to go to fpga_src/boards/ and there go the directory with desired number of Reconfigurable packet processors (RSUs). In current implementation, we want 256 packets to be stored in slots as buffer, so we have 16 slots for 16 RSU, and 32 slots for 8 RSU variant. In each of these directories, there are separate make rules for base design and then swapping PR regions with the desired accelerator.
For example, if in fpga_src/boards/VCU1525_200g_8G you do make
it would first build the base image, and then do the Partioanl Reconfiguration (PR) runs in this order:
make base_0
make PIG_Hash_1
make PIG_base_2
make PIG_RR_3
The numbers at the end indicate the order. The Makefile rules run some tcl scripts underneath located in the fpga directory. base_0 is the base image with static regions, PIG_Hash_1 (run_PIG_Hash.tcl) would add Pigasus string matching accelerator to the RSUs. PIG_base_2 (run_base_RR.tcl) would only update the load balancer to be round robin without changing the RSUs from the base design. And the PIG_RR_3 (run_PIG_RR_merge.tcl) would merge the first two, meaning taking the RSUs from PIG_Hash_1 and load balancer from PIG_base_2.
Similalry VCU1525_200g_16G/ AU200_200g_16G there is only the RSUs with firewall IP, and doing make
would run the following rules:
make base_0
make FW_RR_1
FW_RR_1 runs the run_FW_RR.tcl.
Note that in any of these directories you can remove all the generated files using
make clean
Which is generally useful to avoid undesired reuse of files by Vivado and can cause inconsistent results.
-
make base_0
runs tcl scripts in the fpga directory for the base design. create_project.tcl generates the project and adds the required files. Then run_synth.tcl defines the reonfigurable regions, and runs the synthesize process. Next run_impl_1.tcl performs the place and route process. Finally fpga/generate_bit.tcl generates the full FPGA image. -
add_wrapper_rect.tcl and hide_rect.tcl are used for visualization of the pblock for figures. generate_reports.tcl generates reports for resource utilization for the PR runs. force_phys_opt.tcl is rarely used when Vivado thinks the design does not need any optimization and skips them, and eventually fails. This script forces Vivado to run the optimizations anyways.
-
We can go directly from base to using round robin load balancer and RSUs with Pigasus, but it takes longer and might fail as it might get to challenging for the heuristic algorithms. As an example, run_PIG_RR.tcl uses this method, but during development iterations sometimes it met timing and sometimes it failed.
-
Vivado does not support use of child runs (PR runs) in another child run, only you can reuse the PR modules from the parent run (here the base run with static regions).If it is only merging the PR regions from the child runs, we can use the non-project mode and add an in_memory project to get around this issue. For example, run_PIG_RR_merge.tcl does this and picks the RSUs from PIG_Hash_1 run and the load balancer from PIG_base_2 run. However, if we want to only change some of the PR runs relative to another child run, and let the place and route run, things get more complicated. Using some hacky method that within the run changes some file contents from Linux shell, run_PIG_RR_inc.tcl can use RSUs from PIG_Hash_1 and then build the round robin load balancer and attach them. That being said, using an extra child with only the load balancer changed and then merging them is faster and not hacky, and hence that script is just as archive.
Some example accelerators can be found in fpga_src/accel:
- pigasus_sme: Ported Pigasus string matcher accelerator
- hash: a hash accelerator for TCP/UDP headers
- ip_matcher: from-scratch firewall accelerator
- The archive directory has our older accelerators such as string matcher based on Aho–Corasick algorithm.
Note that each of these directories have the rtl for the Verilog code, c for the program code, tb for simulations, and python if there are some scripts to generate the Verilog code.
To connect any accelerator to a RSU, you can use the Verilog interface provided in the accelerator wrapper, for example fpga_src/accel/ip_matcher/rtl/accel_wrap_firewall.v, and simply instantiate your accelerators and set the register MMIO.
To synthesize and then place and route them for FPGA, you can use the tcl scripts in fpga_src/boards/*/fpga as examples and replace the accelerator files. As mentioned above, there are example on how to only change the RSUs, how to change the load balancer, how to change all the PR regions together, how to merge results of these runs, or even how to reuse another PR run in the next one.
The default load balancer is round robin for 16 RSU designs, and hash-based load balancer for 8 RSU designs. (currently 16 RSU design is used for firewall and there is no difference between RSUs, the 8 RSU design is used for intrusion detection and we need flow state). If any different load balancer is desired, you can change/add it under the rtl directory (e.g., fpga_src/boards/VCU1525_200g_16G/rtl) and simply replace the load balancer Verilog file in the create_project.tcl script. For example, change *../rtl/RR_LU_scheduler_PR.v * in fpga_src/boards/VCU1525_200g_16G/fpga/create_project.tcl.
- Unfortunately, top level Verilog file for a reconfigurable module cannot be parametrized in Vivado. Therefore, the load balancer examples are placed in the rtl directory per board. If you want to add a new load balancer, you can follow the suit of putting the parameters right after the module declaration, so only the ports require updates, or use macros.
There is a micro-usb header on the supported cards, that provides the JTAG interface for programming the bitfile. You can fire up a Vivado and make sure the connection is good to go and select the top bitfile for programming. Another method is to use the host_utils/runtime/loadbit.sh, based on the devices ID. After programming the FPGA, a restart is required for the PCIe IP to properly be recognized.
Shire also provides a driver to be able to talk to the card, where it can be seen as a normal NIC and all the future communications, even reconfiguration of RSUs, are done over PCIe which is much faster than JTAG. We use the corundum module to provide the NIC interface. Note that the corundum hardware used is older than the current version available in the corundum repo, and the newer driver is not compatible. We added some ioctl memory ranges to directly access RCU memory to Corundum’s driver.
To build the driver, go to host-utils/driver and do
make
Then you can load the driver with
sudo insmod mqnic.ko
Now we need to reset the PCIe card to let the driver be properly loaded. To do so run the pcie_hot_reset.sh from host_utils/runtime based on the proper device ID. For example:
sudo ./pcie_hot_reset.sh 81:00.0
If necessary to remove the driver, you can do so by:
sudo rmmod mqnic.ko
Files to compile a C program can be found in riscv_code directory:
- riscv_encoding.h has the defines for the VexRiscv.
- core.h is the header file for functions to talk to the wrapper.
- int_handler is a default interrupt handler if user does not want to specify their own.
- startup.S has the required boot process for the core to initialize stack and prepare the interrupts, and jump to start of the code.
- link_option.ld provide the mapping of segments based on the Shire addressing.
- hex_gen.py script converts the output binary files based on the required format of RISCV cores.
- Makefile generates the proper outputs for the desired C code (set by NAME), with separate files for instruction memory and data memory that can be directly loaded.
Note that if you want to initialize part of the dmem (small memory local to core) or pmem (large memory shared by the core and the accelerators), such as tables or data segment, you should add a .map file. For example, for pkt_gen we want to initiate the memories with zero, and then load the dmem contents. So in pkt_gen.map we have:
empty_dmem.bin 0x00800000
empty_pmem.bin 0x01000000
pkt_gen_data.bin 0x00800000
This file is used by the rvfw code in host-utils/runtime to load the sections in the provided order with the provided binary at the provided address. In current implementations, dmem starts at 0x00800000, and pmem starts at 0x01000000. The imem binary is automatically read by the rvfw code and does not require to be in the map file.
The empty_*.bin files can be generated using the table_gen.py script. Note that pmem is made from Ultra-RAMs in the currently supported boards, which are not updated by writing the bitfile to the FPGA. So if some state is stored in the pmem of an RSU, after reconfiguration it must either be zeroed out, or saved before eviction and loaded after reconfiguration.
The other *.c/*.h files are used for the tests. The runtime scripts can directly call this makefile and use the outputted binaries for loading the RISCV cores.
File to load RISCV programs and example C code to monitor the state are in host-utils/runtime. The main files are:
- mqnic.c/h talks to the corundum driver.
- rvfw.c/h is used to program memory of RPUs (similar to a firmware loader)
- gousheh.c/h has functions to talk to each RPU during runtime
- pr_reload.c has the functionality to use MCAP and reload a RPU.
- timespec.c/h is for Linux’s timespec structure
- Makefile generates the binaries for this files. (Just do Make)
Perf.c monitors the state of the RPUs during run, and dump.c dumps the state of the RPUs. For example they print out how many packets and bytes were communicated per core and in the scheduler, or if some debug bits were set or core sent some debug messages. Note that a full-fledged debugging infra between the cores and scheduler and host is baked into Shire’s design. For example you can interrupt the cores in case of hang and send them 64-bit messages in case the data channel is stuck. The pr_reload code also uses the evict interrupt as an example.
The Makefile can used to run other tests, and it gets parameters for how many cores to be enabled and programmed (ENABLE), and how many cores to receive packets (RECV). ENABLE and RECV are in one-hot representation. DEBUG sets which debug register to be monitored, DEV sets the desired card (e.g., if more than one is used), and TEST sets the program to be loaded. OUT_FILE sets the name of output csv log file. As an example and to run our tests, make do
runs the scripts in order: compile the program, firmware load, and start the monitoring process. Finally, The run_latency script is used for our latency measurement which uses tcpdump and runs the code for different packet sizes.
You can do Make do for the the default code, which is a forwarder between the two ports, so if you feed the ports from the 100G NICs, you should see bytes/packets in the output.
The generated RISCV Verilog code is placed at fpga_src/lib/Shire/rtl/VexRiscv.v. If you want to configure it differently, go to fpga_src/VexRiscv and do make edit to open the tailored configuration file. After updating the configuration file, doing make in fpga_src/VexRiscv builds the riscv, and doing make copy copies it to the proper place (fpga_src/lib/Shire/rtl/VexRiscv.v).
Note that the connection to memory are optimized to enable single cycle read, and complicating the memory path can result in lower maximum frequency, i.e. less than 250 MHz. Also the next version of code in VexRiscv did not meet timing for 250 MHz either. Using a larger riscv, specially if not designed for FPGAs and 64-bit, might make it even harder to meet timing.
The ultimate goal of Shire is to have these RISCV cores to be hard logic not to get into this challenge. Even though that limits adding new instructions, a more capable RISCV core with higher frequency, alongside accelerators that can be accessed within a same cycle can improve overall system performance even without the customized instructions.
Alongside the code for each board, there is a simulation framework to be able to test the Verilog and C-code alongside each other. Scripts and examples for single RSU and full Shire tests are available. As an example, in fpga_src/boards/VCU1525_200g_16G/tb there are these directories:
- common has the top level test module for full Shire, as well as common.py which has the same functions as the functions that host can use to communicate with the FPGA (just in python instead of C).
- test_firewall_sg is a testbench for firewall accelerator that is integrated within an RSU, and the C-code can be tested. test_gousheh.v is the top level test module for single RSU, and test_gousheh.py is the python testbench. The testbench file loads the RSU memories, similar to the scripts in host_utils, and runs the desired tests.
- test_ins_load tests load of instruction memories, or communication to host DRAM, alongside a C-code that simply forwards the packets, as well as write and reads to DRAM.
- test_corundum test the functionality of corundum, alongside a C-code that simply forwards packets to the host.
- test_inter_core tests the intercore messaging system, test_latency tests the latency code, and test_pkt_gen tests the packet generation code.
- archive is older tests that are depreciated.
Note that the python script would look for the binary of the program, so the binary should be compiled beforehand (whether in riscv_code or per accelerators, e.g., fpga_src/accel/pigasus_sme/c). The accelerators can have their own testbench without the riscv as well, and such examples can be found in fpga_src/accel/pigasus_sme/tb. The tests can be run by simply running make
.
├──fpga_src │ ├── VexRiscv: Copy of the VexRiscv repo with specific commit, and the added configuration for the tailored RISCV core we used. │ ├── accel: Used accelerators for Shire. The archive directory has our older accelerators which are depreciated. │ ├── boards: Currently supporting VCU1515 (8 or 16 RPUs) and AU200 (16 RPUs). │ └── lib: The libraries we used in our design, Shire specifically for this project, and axi/axis/corundum/pcie/ethernet as imported libraries developed by Alex Forencich. ├──host_utils │ ├── driver: mqnic is based on the corundum driver. The bump driver is for internal use and depreciated. │ └──runtime: Contains the host side libraries and scripts to talk to the FPGA and perform test during runtime. └──riscv_code: Required libraries and file to build RISCV programs.