Code repository for the SIGMOD 25 paper: "Apt-Serve: Adaptive Request Scheduling on Hybrid Cache for Scalable LLM Inference Serving".
Apt-Serve is a serving framework prototype implemented on top of vLLM (release version: 0.5.0 post1). All the adds-on by the framework are located in the folder additional_designs.
Note that Apt-Serve is a research prototype, which does not support complete features of the up-to-date vLLM. We have only adopted some key parts of the codebase for faster research iterations.
- Install the backbone system (vLLM 0.5.0. post1) first. Following guidelines from https://github.com/vllm-project/vllm.
- Insert the additional designs:
bash additional_designs/insert_designs.sh
- Install the customized cuda kernels to support hybrid cache:
python additional_designs/mixed_cache_kernels/mixed_cache_setup.py build_ext --inplace
With all these steps completed, the necessary implementation for the new designs has been integrated into vLLM and is ready for usage.
Following readme.md from the folder sample_requests_from_datasets to sample requests to create a serving trace.
The sampled requests are automatically saved into ./sampled_datasets/ folder.
Use OPT-13B as an example.
Start the server side by:
python -m vllm.entrypoints.openai.api_server --model facebook/opt-13b --enforce-eager --disable-log-requests
After the server side is set up, start the client side code to simulate the request arrivals:
python gen_client_requests.py --model facebook/opt-13b --request-rate 3 --cv 1 --dataset sharegpt

