Skip to content

v0.3.0

Latest

Choose a tag to compare

@SachinVarghese SachinVarghese released this 26 Nov 21:06
· 5 commits to main since this release
a85b31b

This release comes with some major improvements:

  • Trace file based load generation and testing
  • Support for benchmarking multi-turn chat scenarios
  • Shared client session for load generation performance
  • Improved helm chart configurations for kubernetes deployment
  • End to end tests on CI/CD pipeline

What's Changed

  • Improve efficiency and readability of data generators by @pancak3 in #210
  • Make selected request rates accurate to two decimal places (formerly zero) when using linear sweep type by @Bslabe123 in #237
  • Add debug log for saturation sampling by @jjk-g in #236
  • ci: push helm chart to OCI registry when release by @ExplorerRay in #240
  • chore: add inter_token_latency in ModelServerMetrics for sglang metrics by @jlcoo in #242
  • use achieved_rate in the report graph. by @zetxqx in #232
  • Improve docker image building by @pancak3 in #228
  • feat: Enhance Helm chart flexibility for job by @LukeAVanDrie in #248
  • Catch saturation detection failure by @jjk-g in #251
  • Adding time per output tokens prometheus metrics for sglang server by @SachinVarghese in #254
  • feat: loadgen SIGINT handler by @changminbark in #244
  • Feat: Add request timeouts and circuit breakers (#148) by @huaxig in #227
  • Added PrometheusMetric Implementations by @Bslabe123 in #221
  • Workflow that currently pushes Docker image now also pushes Helm chart by @Bslabe123 in #259
  • Fix for Invalid Chart Version by @Bslabe123 in #261
  • Add jjk-g to maintainers by @achandrasekar in #267
  • Update helm chart to pass in gcs bucket to download datasets. by @rlakhtakia in #260
  • Fixing test and validate workflows by @SachinVarghese in #272
  • publish-on-change workflow should use helm client login instead of docker login by @Bslabe123 in #264
  • Add Kubecon Demo results by @Bslabe123 in #224
  • docs: clarify authentication needed for querying metrics from GMP by @Bslabe123 in #276
  • Update vLLM kv cache metric from vllm:gpu_cache_usage_perc to vllm:kv_cache_usage_perc by @Bslabe123 in #277
  • Update helm to add service account name by @rlakhtakia in #270
  • Update helm chart to pull datasets from s3 bucket. by @rlakhtakia in #278
  • Trace load gen by @aish1331 in #198
  • fix: stabilize streaming responses for large chunk using iter_any() by @zetxqx in #284
  • fix: custom tokenizer truncates inputs to model max input length by @changminbark in #266
  • [Testing / CI/CD] Ability to automate scale testing with a mock server and test different datasets, loadgen, etc. and run it as a part of CI/CD (#274) by @huaxig in #274
  • Update helm to pass in existing kubernetes secret. by @rlakhtakia in #281
  • Loadgen concurrent load type by @changminbark in #263
  • Improve MultiprocessRequestDataCollector async by @diamondburned in #280
  • update gcs bucket to pass in bucket name only for consistency by @rlakhtakia in #285
  • Feat: Add user session to support Multi-turn chat (#179) by @huaxig in #257
  • fix pyproject dependency groups and TOML parsing issue by @diamondburned in #291
  • Fix overflow on tokenizer truncation by @jjk-g in #290
  • chore: improve openai client error handling by including status code and reason by @hhk7734 in #289
  • Fix: requests get duplicated using shared_prefix datagen when multi-turn chat disabled by @huaxig in #293
  • Share aiohttp.ClientSessions per worker by @diamondburned in #282

New Contributors

Full Changelog: v0.2.0...v0.3.0

Docker Image

quay.io/inference-perf/inference-perf:v0.3.0

Python Package

pip install inference-perf==v0.3.0