You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
[`vLLM`](https://docs.vllm.ai/en/latest/) is an easy-to-use library for LLM inference and serving which support a wide variety of models with optimized kernels ensuring optimal utilization of GPUs.
5
+
6
+
## Why `vLLM`?
7
+
We tested `vLLM` and `llama-cpp` (the inference framework behind `ollama`) on Torch, and found `vLLM` performs better on Torch for `Qwen2.5-7B-Instruct` with `512` input and `256` output tokens.
Create a vLLM directory in your /scratch directory, then install the vLLM image:
22
+
Create a `vLLM` directory in your /scratch directory, then install the vLLM image:
17
23
```
18
24
apptainer pull docker://vllm/vllm-openai:latest
19
25
```
20
-
### Use High-Performance SCRATCH Storage
21
-
LLMs require very fast storage. On Torch, the SCRATCH filesystem is an all-flash system designed for AI workloads, providing excellent performance.To avoid exceeding your $HOME quota (50GB) and inode limits (30,000 files), you should redirect vLLM's cache and Hugging Face's model downloads to your scratch space:
22
-
```
26
+
### Avoid filling up your `$HOME` directory
27
+
To avoid exceeding your `$HOME` quota (50GB) and inode limits (30,000 files), you should redirect `vLLM`'s cache and Hugging Face's model downloads to your scratch space:
28
+
```sh
23
29
export HF_HOME=/scratch/$USER/hf_cache
24
30
export VLLM_CACHE_ROOT=/scratch/$USER/vllm_cache
25
31
```
26
-
You should run this to configure vLLM to always use your SCRATCH storage for consistent use:
27
-
```
32
+
You should run this to configure `vLLM` to always use your `$SCRATCH` storage for consistent use:
Note: Files on $SCRATCH are not backed up and will be deleted after 60 days of inactivity. Always keep your source code and .slurm scripts in $HOME!
38
+
:::note
39
+
Files on `$SCRATCH` are not backed up and will be deleted after 60 days of inactivity. Always keep your source code and .slurm scripts in `$HOME`!
40
+
:::
33
41
34
42
## Run vLLM
35
43
### Online Serving (OpenAI-Compatible API)
36
-
vLLM implements the OpenAI API protocol, allowing it to be a drop-in replacement for applications using OpenAI's services. By default, it starts the server at http://localhost:8000. You can specify the address with --host and --port arguments.
44
+
`vLLM` implements the OpenAI API protocol, allowing it to be a drop-in replacement for applications using OpenAI's services. By default, it starts the server at `http://localhost:8000`. You can specify the address with `--host` and `--port` arguments.
37
45
**In Terminal 1:**
38
46
Start vLLM server (In this example we use Qwen model):
If you need to process a large dataset at once without setting up a server, you can use vLLM's LLM class.
62
-
For example, the following code downloads the facebook/opt-125m model from HuggingFace and runs it in vLLM using the default configuration.
63
-
```
69
+
If you need to process a large dataset at once without setting up a server, you can use `vLLM`'s LLM class.
70
+
For example, the following code downloads the `facebook/opt-125m` model from HuggingFace and runs it in `vLLM` using the default configuration.
71
+
```python
64
72
from vllm importLLM
65
73
66
74
# Initialize the vLLM engine.
67
75
llm = LLM(model="facebook/opt-125m")
68
76
```
69
77
After initializing the LLM instance, use the available APIs to perform model inference.
70
78
71
-
### SGLang: A Simple Option for Offline Batch Inference (Supplement Material)
79
+
### SGLang: A Simple Option for Offline Batch Inference
72
80
For cases where users only want to run batch inference and do not need an HTTP endpoint, SGLang provides a much simpler offline engine API compared to running a full vLLM server. It is particularly suitable for dataset processing, evaluation pipelines, and one-off large-scale inference jobs.
73
-
For more details and examples, see the official SGLang offline engine documentation:
0 commit comments