This guide explains how to deploy inference-perf to a Kubernetes cluster as a job.
Refer to the guide in /deploy/inference-perf.
inference-perf requires all config be configured in a single yaml file and passed via the -c flag. When deploying as a job the most straightforward way to pass this value is to create a ConfigMap and then mount the ConfigMap in the Job. Update the config.yml as needed then create the ConfigMap by running at the root of this repo:
kubectl create configmap inference-perf-config --from-file=config.ymlOptional: Create a Kubernetes Secret that contains the Hugging Face token:
Note: this step is required for gated models only
kubectl create secret generic hf-secret \
--from-literal=hf_api_token=${HF_TOKEN} \
--dry-run=client -o yaml | kubectl apply -f -* For huggingface authentication, please refer to “Hugging Face Authentication” in the section Run locally
Apply the job by running the following:
kubectl apply -f manifests.yamlCurrently, inference-perf outputs benchmark results to standard output only. To view the results after the job completes, run:
kubectl wait --for=condition=complete job/inference-perf && kubectl logs jobs/inference-perf