Skip to content

v0.8.0

Choose a tag to compare

@mayabar mayabar released this 26 Mar 07:03
· 30 commits to main since this release
eedfce4

⚠️ Important Changes

Please read before upgrading.

What’s new:

New Dependency: Tokenizer is now a stand alone application which should run as a sidecar process.

For details see README.md

Deprecated command line parameters:

  • tokenizers-cache-dir
  • zmq-max-connect-attempts

New Features

  • New endpoint /v1/embeddings
  • gRPC support - details
  • /chat/completions works with --enable-kvcache
  • Added support for --mm-encoder-only
  • Support --no- prefix for boolean vllm config parameters
    • no-enable-sleep-mode
    • no-mm-encoder-only
    • no-enforce-eager
    • no-enable-prefix-caching
  • Fake metrics support functions for gauges
  • Dataset structure updated, dataset tool is updated accordingly
  • All requests are tokenized using the model defined in the configuration. Important: If you want to avoid the time and network overhead of HuggingFace tokenization use a "fake" or non-existent model name (e.g., --model fake-model).
  • Extend kv events - add tokens
  • New metrics
    • vllm:prefix_cache_hits
    • vllm:prefix_cache_queries

What's Changed

  • Introduce Tokenizer interface by @mayabar in #314
  • fix hf models url by @mayabar in #316
  • Set default value of --tokenizers-cache-dir to hf_cache by @mayabar in #317
  • Tokenize all requests by @irar2 in #318
  • Use real tokenization in echo mode by @irar2 in #319
  • Echo Dataset by @irar2 in #322
  • fix python error on hf tokenizer initialization by @mayabar in #321
  • Return tokenized response in GetTokens by @irar2 in #323
  • Use Tokenized in response by @irar2 in #324
  • Handle gRPC requests by @irar2 in #326
  • Metrics tpot channel size fix and new tests for errors by @irar2 in #328
  • Dataset tool by @mayabar in #325
  • Generation request and response types by @irar2 in #330
  • update documentation by @mayabar in #329
  • 🌱 Standardize governance workflows, tooling, and Dependabot by @clubanderson in #333
  • 🌱 Remove legacy typo and link checker workflows by @clubanderson in #340
  • docs(example): Fix indentation for POD_IP valueFrom field by @tarilabs in #348
  • Update example of ruuning simulator in the documentation by @mayabar in #351
  • 🌱 Remove orphaned .lychee.toml by @clubanderson in #352
  • Refactor: separate token generation from response sending by @irar2 in #353
  • Add tokens to kv events by @mayabar in #354
  • Fix /chat/completion response in echo mode by @mayabar in #362
  • Fix PR #362 by @mayabar in #365
  • Add vllm:prefix_cache_hits and vllm:prefix_cache_queries counters by @InfraWhisperer in #358
  • Add /v1/embeddings endpoint by @sbekkerm in #364
  • Response builder by @irar2 in #372
  • Read configuration in main by @irar2 in #373
  • Separate simulator creation and start. Communication layer by @irar2 in #375
  • 🌱 Remove per-repo gh-aw typo/link/upstream workflows by @clubanderson in #381
  • Ignore data-parallel-size if data-parallel-rank is set by @irar2 in #376
  • feat(http): add pod/namespace/request-id response headers to /embeddings by @sbekkerm in #374
  • Separate communication (HTTP and gRPC) from the simulator code by @irar2 in #382
  • Support functions for generating fake gauge metrics by @irar2 in #389
  • Bug fix: fake metrics init by @irar2 in #391
  • Refactoring: store channels along their names in a struct by @irar2 in #390
  • Use kv cache 0.6.0 - tokenizer is stand alone + remove all python dependencies by @mayabar in #386
  • fixes in makefile by @mayabar in #395
  • Chat completion with kvcache by @mayabar in #396
  • Support mm-encoder-only mode by @irar2 in #398
  • Update readme by @irar2 in #401
  • Add --no option for vLLM boolean command line parameters by @irar2 in #400
  • Remove CGO dependency by migrating to pure-Go ZMQ+change in ci_pr_checks by @mayabar in #406

New Contributors

Full Changelog: v0.7.0...v0.8.0