This release comes with some major improvements:
- Trace file based load generation and testing
- Support for benchmarking multi-turn chat scenarios
- Shared client session for load generation performance
- Improved helm chart configurations for kubernetes deployment
- End to end tests on CI/CD pipeline
What's Changed
- Improve efficiency and readability of data generators by @pancak3 in #210
- Make selected request rates accurate to two decimal places (formerly zero) when using linear sweep type by @Bslabe123 in #237
- Add debug log for saturation sampling by @jjk-g in #236
- ci: push helm chart to OCI registry when release by @ExplorerRay in #240
- chore: add inter_token_latency in ModelServerMetrics for sglang metrics by @jlcoo in #242
- use achieved_rate in the report graph. by @zetxqx in #232
- Improve docker image building by @pancak3 in #228
- feat: Enhance Helm chart flexibility for job by @LukeAVanDrie in #248
- Catch saturation detection failure by @jjk-g in #251
- Adding time per output tokens prometheus metrics for sglang server by @SachinVarghese in #254
- feat: loadgen SIGINT handler by @changminbark in #244
- Feat: Add request timeouts and circuit breakers (#148) by @huaxig in #227
- Added
PrometheusMetricImplementations by @Bslabe123 in #221 - Workflow that currently pushes Docker image now also pushes Helm chart by @Bslabe123 in #259
- Fix for Invalid Chart Version by @Bslabe123 in #261
- Add jjk-g to maintainers by @achandrasekar in #267
- Update helm chart to pass in gcs bucket to download datasets. by @rlakhtakia in #260
- Fixing test and validate workflows by @SachinVarghese in #272
publish-on-changeworkflow should use helm client login instead of docker login by @Bslabe123 in #264- Add Kubecon Demo results by @Bslabe123 in #224
- docs: clarify authentication needed for querying metrics from GMP by @Bslabe123 in #276
- Update vLLM kv cache metric from
vllm:gpu_cache_usage_perctovllm:kv_cache_usage_percby @Bslabe123 in #277 - Update helm to add service account name by @rlakhtakia in #270
- Update helm chart to pull datasets from s3 bucket. by @rlakhtakia in #278
- Trace load gen by @aish1331 in #198
- fix: stabilize streaming responses for large chunk using iter_any() by @zetxqx in #284
- fix: custom tokenizer truncates inputs to model max input length by @changminbark in #266
- [Testing / CI/CD] Ability to automate scale testing with a mock server and test different datasets, loadgen, etc. and run it as a part of CI/CD (#274) by @huaxig in #274
- Update helm to pass in existing kubernetes secret. by @rlakhtakia in #281
- Loadgen concurrent load type by @changminbark in #263
- Improve MultiprocessRequestDataCollector async by @diamondburned in #280
- update gcs bucket to pass in bucket name only for consistency by @rlakhtakia in #285
- Feat: Add user session to support Multi-turn chat (#179) by @huaxig in #257
- fix pyproject dependency groups and TOML parsing issue by @diamondburned in #291
- Fix overflow on tokenizer truncation by @jjk-g in #290
- chore: improve openai client error handling by including status code and reason by @hhk7734 in #289
- Fix: requests get duplicated using shared_prefix datagen when multi-turn chat disabled by @huaxig in #293
- Share aiohttp.ClientSessions per worker by @diamondburned in #282
New Contributors
- @jlcoo made their first contribution in #242
- @zetxqx made their first contribution in #232
- @LukeAVanDrie made their first contribution in #248
- @changminbark made their first contribution in #244
- @diamondburned made their first contribution in #280
- @hhk7734 made their first contribution in #289
Full Changelog: v0.2.0...v0.3.0
Docker Image
quay.io/inference-perf/inference-perf:v0.3.0
Python Package
pip install inference-perf==v0.3.0