Release v0.3.0 · kubernetes-sigs/inference-perf

This release comes with some major improvements:

Trace file based load generation and testing
Support for benchmarking multi-turn chat scenarios
Shared client session for load generation performance
Improved helm chart configurations for kubernetes deployment
End to end tests on CI/CD pipeline

What's Changed

Improve efficiency and readability of data generators by @pancak3 in #210
Make selected request rates accurate to two decimal places (formerly zero) when using linear sweep type by @Bslabe123 in #237
Add debug log for saturation sampling by @jjk-g in #236
ci: push helm chart to OCI registry when release by @ExplorerRay in #240
chore: add inter_token_latency in ModelServerMetrics for sglang metrics by @jlcoo in #242
use achieved_rate in the report graph. by @zetxqx in #232
Improve docker image building by @pancak3 in #228
feat: Enhance Helm chart flexibility for job by @LukeAVanDrie in #248
Catch saturation detection failure by @jjk-g in #251
Adding time per output tokens prometheus metrics for sglang server by @SachinVarghese in #254
feat: loadgen SIGINT handler by @changminbark in #244
Feat: Add request timeouts and circuit breakers (#148) by @huaxig in #227
Added PrometheusMetric Implementations by @Bslabe123 in #221
Workflow that currently pushes Docker image now also pushes Helm chart by @Bslabe123 in #259
Fix for Invalid Chart Version by @Bslabe123 in #261
Add jjk-g to maintainers by @achandrasekar in #267
Update helm chart to pass in gcs bucket to download datasets. by @rlakhtakia in #260
Fixing test and validate workflows by @SachinVarghese in #272
publish-on-change workflow should use helm client login instead of docker login by @Bslabe123 in #264
Add Kubecon Demo results by @Bslabe123 in #224
docs: clarify authentication needed for querying metrics from GMP by @Bslabe123 in #276
Update vLLM kv cache metric from vllm:gpu_cache_usage_perc to vllm:kv_cache_usage_perc by @Bslabe123 in #277
Update helm to add service account name by @rlakhtakia in #270
Update helm chart to pull datasets from s3 bucket. by @rlakhtakia in #278
Trace load gen by @aish1331 in #198
fix: stabilize streaming responses for large chunk using iter_any() by @zetxqx in #284
fix: custom tokenizer truncates inputs to model max input length by @changminbark in #266
[Testing / CI/CD] Ability to automate scale testing with a mock server and test different datasets, loadgen, etc. and run it as a part of CI/CD (#274) by @huaxig in #274
Update helm to pass in existing kubernetes secret. by @rlakhtakia in #281
Loadgen concurrent load type by @changminbark in #263
Improve MultiprocessRequestDataCollector async by @diamondburned in #280
update gcs bucket to pass in bucket name only for consistency by @rlakhtakia in #285
Feat: Add user session to support Multi-turn chat (#179) by @huaxig in #257
fix pyproject dependency groups and TOML parsing issue by @diamondburned in #291
Fix overflow on tokenizer truncation by @jjk-g in #290
chore: improve openai client error handling by including status code and reason by @hhk7734 in #289
Fix: requests get duplicated using shared_prefix datagen when multi-turn chat disabled by @huaxig in #293
Share aiohttp.ClientSessions per worker by @diamondburned in #282

New Contributors

@jlcoo made their first contribution in #242
@zetxqx made their first contribution in #232
@LukeAVanDrie made their first contribution in #248
@changminbark made their first contribution in #244
@diamondburned made their first contribution in #280
@hhk7734 made their first contribution in #289

Full Changelog: v0.2.0...v0.3.0

Docker Image

quay.io/inference-perf/inference-perf:v0.3.0

Python Package

pip install inference-perf==v0.3.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

v0.3.0

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

What's Changed

New Contributors

Docker Image

Python Package

Contributors

Uh oh!