Skip to content

gouwsxander/stress-test

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

17 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

LLM API Load Testing

Simple scripts for stress testing LLM API endpoints by measuring throughput under varying concurrent loads.

This was made for testing my Apple Intelligence Web API, which conforms to the same standards as OpenRouter and OpenAI.

Overview

Major files:

  • ./src/main.py: Measure response times and throughput to varying concurrent loads
  • ./src/plotting.py: Run data analysis and visualization
  • ./results/data.csv: Data from experiments
  • ./results/throughput_analysis.pdf: Data visualized with curve fit

Usage

0. Requirements

This project uses uv for dependency management. Dependencies will be automatically installed if you run the scripts with uv run.

Alternatively, install manually with pip:

pip install aiohttp numpy scipy matplotlib

1. Generate test data

Run the load test against your API endpoint:

uv run ./src/main.py > ./results/data.csv &

This will send batches of concurrent requests with varying load levels and output CSV data.

2. Analyze results

Visualize the throughput degradation:

uv run ./src/plotting.py

This generates a plot showing:

  • Raw throughput measurements
  • Binned means with error bars
  • Fitted exponential decay curve

Currently, throughput is estimated by dividing the number of characters by the average 4 tokens per character.

Contribution

Though I'm done with the project, I would welcome any PRs!

Here are some things that I think need changing:

  1. Rather than running experiments by sending a set number of requests in batches, set a 'request rate' at which new requests will be sent to the API.
  2. Use streaming completions to separate out latency and throughput for each request.
  3. Count tokens using, e.g., tiktoken, or using the response.usage.completion_tokens field if the API supports it. Or, if streaming, can we assume that each chunk is one token?
  4. Use a combination of argparse and configuration files (where appropriate) to specify run configuration, rather than having that information stored in the code.

About

Simple scripts for stress testing LLM API endpoints by measuring throughput under varying concurrent loads.

Resources

License

Stars

Watchers

Forks

Contributors

Languages