v0.8.0
⚠️ Important Changes
Please read before upgrading.
What’s new:
New Dependency: Tokenizer is now a stand alone application which should run as a sidecar process.
For details see README.md
Deprecated command line parameters:
- tokenizers-cache-dir
- zmq-max-connect-attempts
New Features
- New endpoint
/v1/embeddings - gRPC support - details
/chat/completionsworks with--enable-kvcache- Added support for
--mm-encoder-only - Support
--no-prefix for boolean vllm config parameters- no-enable-sleep-mode
- no-mm-encoder-only
- no-enforce-eager
- no-enable-prefix-caching
- Fake metrics support functions for gauges
- Dataset structure updated, dataset tool is updated accordingly
- All requests are tokenized using the model defined in the configuration. Important: If you want to avoid the time and network overhead of HuggingFace tokenization use a "fake" or non-existent model name (e.g., --model fake-model).
- Extend kv events - add tokens
- New metrics
- vllm:prefix_cache_hits
- vllm:prefix_cache_queries
What's Changed
- Introduce Tokenizer interface by @mayabar in #314
- fix hf models url by @mayabar in #316
- Set default value of --tokenizers-cache-dir to hf_cache by @mayabar in #317
- Tokenize all requests by @irar2 in #318
- Use real tokenization in echo mode by @irar2 in #319
- Echo Dataset by @irar2 in #322
- fix python error on hf tokenizer initialization by @mayabar in #321
- Return tokenized response in GetTokens by @irar2 in #323
- Use Tokenized in response by @irar2 in #324
- Handle gRPC requests by @irar2 in #326
- Metrics tpot channel size fix and new tests for errors by @irar2 in #328
- Dataset tool by @mayabar in #325
- Generation request and response types by @irar2 in #330
- update documentation by @mayabar in #329
- 🌱 Standardize governance workflows, tooling, and Dependabot by @clubanderson in #333
- 🌱 Remove legacy typo and link checker workflows by @clubanderson in #340
- docs(example): Fix indentation for POD_IP valueFrom field by @tarilabs in #348
- Update example of ruuning simulator in the documentation by @mayabar in #351
- 🌱 Remove orphaned .lychee.toml by @clubanderson in #352
- Refactor: separate token generation from response sending by @irar2 in #353
- Add tokens to kv events by @mayabar in #354
- Fix /chat/completion response in echo mode by @mayabar in #362
- Fix PR #362 by @mayabar in #365
- Add vllm:prefix_cache_hits and vllm:prefix_cache_queries counters by @InfraWhisperer in #358
- Add /v1/embeddings endpoint by @sbekkerm in #364
- Response builder by @irar2 in #372
- Read configuration in main by @irar2 in #373
- Separate simulator creation and start. Communication layer by @irar2 in #375
- 🌱 Remove per-repo gh-aw typo/link/upstream workflows by @clubanderson in #381
- Ignore data-parallel-size if data-parallel-rank is set by @irar2 in #376
- feat(http): add pod/namespace/request-id response headers to /embeddings by @sbekkerm in #374
- Separate communication (HTTP and gRPC) from the simulator code by @irar2 in #382
- Support functions for generating fake gauge metrics by @irar2 in #389
- Bug fix: fake metrics init by @irar2 in #391
- Refactoring: store channels along their names in a struct by @irar2 in #390
- Use kv cache 0.6.0 - tokenizer is stand alone + remove all python dependencies by @mayabar in #386
- fixes in makefile by @mayabar in #395
- Chat completion with kvcache by @mayabar in #396
- Support mm-encoder-only mode by @irar2 in #398
- Update readme by @irar2 in #401
- Add --no option for vLLM boolean command line parameters by @irar2 in #400
- Remove CGO dependency by migrating to pure-Go ZMQ+change in ci_pr_checks by @mayabar in #406
New Contributors
- @tarilabs made their first contribution in #348
- @InfraWhisperer made their first contribution in #358
- @sbekkerm made their first contribution in #364
Full Changelog: v0.7.0...v0.8.0