Release v0.8.0 · llm-d/llm-d-inference-sim

⚠️ Important Changes

Please read before upgrading.

What’s new:

New Dependency: Tokenizer is now a stand alone application which should run as a sidecar process.

For details see README.md

Deprecated command line parameters:

tokenizers-cache-dir
zmq-max-connect-attempts

New Features

New endpoint /v1/embeddings
gRPC support - details
/chat/completions works with --enable-kvcache
Added support for --mm-encoder-only
Support --no- prefix for boolean vllm config parameters
- no-enable-sleep-mode
- no-mm-encoder-only
- no-enforce-eager
- no-enable-prefix-caching
Fake metrics support functions for gauges
Dataset structure updated, dataset tool is updated accordingly
All requests are tokenized using the model defined in the configuration. Important: If you want to avoid the time and network overhead of HuggingFace tokenization use a "fake" or non-existent model name (e.g., --model fake-model).
Extend kv events - add tokens
New metrics
- vllm:prefix_cache_hits
- vllm:prefix_cache_queries

What's Changed

Introduce Tokenizer interface by @mayabar in #314
fix hf models url by @mayabar in #316
Set default value of --tokenizers-cache-dir to hf_cache by @mayabar in #317
Tokenize all requests by @irar2 in #318
Use real tokenization in echo mode by @irar2 in #319
Echo Dataset by @irar2 in #322
fix python error on hf tokenizer initialization by @mayabar in #321
Return tokenized response in GetTokens by @irar2 in #323
Use Tokenized in response by @irar2 in #324
Handle gRPC requests by @irar2 in #326
Metrics tpot channel size fix and new tests for errors by @irar2 in #328
Dataset tool by @mayabar in #325
Generation request and response types by @irar2 in #330
update documentation by @mayabar in #329
🌱 Standardize governance workflows, tooling, and Dependabot by @clubanderson in #333
🌱 Remove legacy typo and link checker workflows by @clubanderson in #340
docs(example): Fix indentation for POD_IP valueFrom field by @tarilabs in #348
Update example of ruuning simulator in the documentation by @mayabar in #351
🌱 Remove orphaned .lychee.toml by @clubanderson in #352
Refactor: separate token generation from response sending by @irar2 in #353
Add tokens to kv events by @mayabar in #354
Fix /chat/completion response in echo mode by @mayabar in #362
Fix PR #362 by @mayabar in #365
Add vllm:prefix_cache_hits and vllm:prefix_cache_queries counters by @InfraWhisperer in #358
Add /v1/embeddings endpoint by @sbekkerm in #364
Response builder by @irar2 in #372
Read configuration in main by @irar2 in #373
Separate simulator creation and start. Communication layer by @irar2 in #375
🌱 Remove per-repo gh-aw typo/link/upstream workflows by @clubanderson in #381
Ignore data-parallel-size if data-parallel-rank is set by @irar2 in #376
feat(http): add pod/namespace/request-id response headers to /embeddings by @sbekkerm in #374
Separate communication (HTTP and gRPC) from the simulator code by @irar2 in #382
Support functions for generating fake gauge metrics by @irar2 in #389
Bug fix: fake metrics init by @irar2 in #391
Refactoring: store channels along their names in a struct by @irar2 in #390
Use kv cache 0.6.0 - tokenizer is stand alone + remove all python dependencies by @mayabar in #386
fixes in makefile by @mayabar in #395
Chat completion with kvcache by @mayabar in #396
Support mm-encoder-only mode by @irar2 in #398
Update readme by @irar2 in #401
Add --no option for vLLM boolean command line parameters by @irar2 in #400
Remove CGO dependency by migrating to pure-Go ZMQ+change in ci_pr_checks by @mayabar in #406

New Contributors

@tarilabs made their first contribution in #348
@InfraWhisperer made their first contribution in #358
@sbekkerm made their first contribution in #364

Full Changelog: v0.7.0...v0.8.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v0.8.0

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

⚠️ Important Changes

New Features

What's Changed

New Contributors

Contributors

Uh oh!