Simple scripts for stress testing LLM API endpoints by measuring throughput under varying concurrent loads.
This was made for testing my Apple Intelligence Web API, which conforms to the same standards as OpenRouter and OpenAI.
Major files:
./src/main.py: Measure response times and throughput to varying concurrent loads./src/plotting.py: Run data analysis and visualization./results/data.csv: Data from experiments./results/throughput_analysis.pdf: Data visualized with curve fit
This project uses uv for dependency management. Dependencies will be automatically installed if you run the scripts with uv run.
Alternatively, install manually with pip:
pip install aiohttp numpy scipy matplotlibRun the load test against your API endpoint:
uv run ./src/main.py > ./results/data.csv &This will send batches of concurrent requests with varying load levels and output CSV data.
Visualize the throughput degradation:
uv run ./src/plotting.pyThis generates a plot showing:
- Raw throughput measurements
- Binned means with error bars
- Fitted exponential decay curve
Currently, throughput is estimated by dividing the number of characters by the average 4 tokens per character.
Though I'm done with the project, I would welcome any PRs!
Here are some things that I think need changing:
- Rather than running experiments by sending a set number of requests in batches, set a 'request rate' at which new requests will be sent to the API.
- Use streaming completions to separate out latency and throughput for each request.
- Count tokens using, e.g., tiktoken, or using the
response.usage.completion_tokensfield if the API supports it. Or, if streaming, can we assume that each chunk is one token? - Use a combination of argparse and configuration files (where appropriate) to specify run configuration, rather than having that information stored in the code.