Skip to content

Commit 62074df

Browse files
authored
Merge pull request #296 from NYU-RTS/minor-cleanup
minor polish
2 parents a0659a3 + 4ff2a80 commit 62074df

3 files changed

Lines changed: 41 additions & 34 deletions

File tree

docs/hpc/08_ml_ai_hpc/LLM Inference/01_llm_inferenceoverview.md renamed to docs/hpc/08_ml_ai_hpc/08_LLM inference/01_llm_inferenceoverview.md

File renamed without changes.

docs/hpc/08_ml_ai_hpc/LLM Inference/02_run_hf_model.md renamed to docs/hpc/08_ml_ai_hpc/08_LLM inference/02_run_hf_model.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
# Run a Hugging Face model
1+
# Basic LLM Inference with Hugging Face transformers
22

33
Here we provide an example of how one can run a Hugging Face Large-language model (LLM) on the NYU Torch cluster
44

docs/hpc/08_ml_ai_hpc/LLM Inference/03_vLLM.md renamed to docs/hpc/08_ml_ai_hpc/08_LLM inference/03_vLLM.md

Lines changed: 40 additions & 33 deletions
Original file line numberDiff line numberDiff line change
@@ -1,42 +1,50 @@
1-
# vLLM - A Command Line LLM Tool
1+
# High-performance LLM inference with `vLLM`
2+
23
## What is vLLM?
3-
[vLLM](https://docs.vllm.ai/en/latest/) is a fast and easy-to-use library for LLM inference and serving.
4-
5-
## Why vLLM?
6-
We tested vLLM and llama-cpp on Torch, and found vLLM performs better on Torch:
7-
Model: Qwen2.5-7B-Instruct
8-
Prompt Tokens:512
9-
Output Tokens: 256
10-
|Backend|Peak Throughput|Median Latency(ms)|Recommendation
4+
[`vLLM`](https://docs.vllm.ai/en/latest/) is an easy-to-use library for LLM inference and serving which support a wide variety of models with optimized kernels ensuring optimal utilization of GPUs.
5+
6+
## Why `vLLM`?
7+
We tested `vLLM` and `llama-cpp` (the inference framework behind `ollama`) on Torch, and found `vLLM` performs better on Torch for `Qwen2.5-7B-Instruct` with `512` input and `256` output tokens.
8+
9+
|Inference Server|Peak Throughput|Median Latency(ms)|Recommendation
1110
|-----|-----|-----|-----|
12-
|vLLM|~4689.6|~48.0|Best for Batch/Research|
13-
|llama-cpp|~115.0|~280.0|Best for Single User|
11+
|`vLLM`|~4689.6|~48.0|Best for Batch/Research|
12+
|`llama-cpp`|~115.0|~280.0|Best for Single User|
13+
14+
### Test Environment
15+
GPU: NVIDIA L40S
16+
17+
`vLLM`: 0.13.0
18+
19+
`Ollama` (llama-cpp backend): 0.14.2
1420

1521
## vLLM Installation Instructions
16-
Create a vLLM directory in your /scratch directory, then install the vLLM image:
22+
Create a `vLLM` directory in your /scratch directory, then install the vLLM image:
1723
```
1824
apptainer pull docker://vllm/vllm-openai:latest
1925
```
20-
### Use High-Performance SCRATCH Storage
21-
LLMs require very fast storage. On Torch, the SCRATCH filesystem is an all-flash system designed for AI workloads, providing excellent performance.To avoid exceeding your $HOME quota (50GB) and inode limits (30,000 files), you should redirect vLLM's cache and Hugging Face's model downloads to your scratch space:
22-
```
26+
### Avoid filling up your `$HOME` directory
27+
To avoid exceeding your `$HOME` quota (50GB) and inode limits (30,000 files), you should redirect `vLLM`'s cache and Hugging Face's model downloads to your scratch space:
28+
```sh
2329
export HF_HOME=/scratch/$USER/hf_cache
2430
export VLLM_CACHE_ROOT=/scratch/$USER/vllm_cache
2531
```
26-
You should run this to configure vLLM to always use your SCRATCH storage for consistent use:
27-
```
32+
You should run this to configure `vLLM` to always use your `$SCRATCH` storage for consistent use:
33+
```sh
2834
echo "export HF_HOME=/scratch/\$USER/hf_cache" >> ~/.bashrc
2935
echo "export VLLM_CACHE_ROOT=/scratch/\$USER/vllm_cache" >> ~/.bashrc
3036
```
3137

32-
Note: Files on $SCRATCH are not backed up and will be deleted after 60 days of inactivity. Always keep your source code and .slurm scripts in $HOME!
38+
:::note
39+
Files on `$SCRATCH` are not backed up and will be deleted after 60 days of inactivity. Always keep your source code and .slurm scripts in `$HOME`!
40+
:::
3341

3442
## Run vLLM
3543
### Online Serving (OpenAI-Compatible API)
36-
vLLM implements the OpenAI API protocol, allowing it to be a drop-in replacement for applications using OpenAI's services. By default, it starts the server at http://localhost:8000. You can specify the address with --host and --port arguments.
44+
`vLLM` implements the OpenAI API protocol, allowing it to be a drop-in replacement for applications using OpenAI's services. By default, it starts the server at `http://localhost:8000`. You can specify the address with `--host` and `--port` arguments.
3745
**In Terminal 1:**
3846
Start vLLM server (In this example we use Qwen model):
39-
```
47+
```sh
4048
apptainer exec --nv vllm-openai_latest.sif vllm serve "Qwen/Qwen2.5-0.5B-Instruct"
4149
```
4250
When you see:
@@ -46,7 +54,7 @@ Application startup complete.
4654
Open another terminal and log in to the same computing node as in terminal 1.
4755

4856
**In Terminal 2**
49-
```
57+
```sh
5058
curl http://localhost:8000/v1/chat/completions \
5159
-H "Content-Type: application/json" \
5260
-d '{
@@ -58,33 +66,32 @@ curl http://localhost:8000/v1/chat/completions \
5866
```
5967

6068
### Offline Inference
61-
If you need to process a large dataset at once without setting up a server, you can use vLLM's LLM class.
62-
For example, the following code downloads the facebook/opt-125m model from HuggingFace and runs it in vLLM using the default configuration.
63-
```
69+
If you need to process a large dataset at once without setting up a server, you can use `vLLM`'s LLM class.
70+
For example, the following code downloads the `facebook/opt-125m` model from HuggingFace and runs it in `vLLM` using the default configuration.
71+
```python
6472
from vllm import LLM
6573

6674
# Initialize the vLLM engine.
6775
llm = LLM(model="facebook/opt-125m")
6876
```
6977
After initializing the LLM instance, use the available APIs to perform model inference.
7078

71-
### SGLang: A Simple Option for Offline Batch Inference (Supplement Material)
79+
### SGLang: A Simple Option for Offline Batch Inference
7280
For cases where users only want to run batch inference and do not need an HTTP endpoint, SGLang provides a much simpler offline engine API compared to running a full vLLM server. It is particularly suitable for dataset processing, evaluation pipelines, and one-off large-scale inference jobs.
73-
For more details and examples, see the official SGLang offline engine documentation:
74-
https://docs.sglang.io/basic_usage/offline_engine_api.html
81+
For more details and examples, see the official SGLang offline engine documentation here: https://docs.sglang.io/basic_usage/offline_engine_api.html
7582

7683

77-
## vLLM CLI
78-
The vllm command-line tool is used to run and manage vLLM models. You can start by viewing the help message with:
79-
```
84+
## `vLLM` CLI
85+
The `vllm` command-line tool is used to run and manage `vLLM` models. You can start by viewing the help message with:
86+
```sh
8087
vllm --help
8188
```
8289
Serve - Starts the vLLM OpenAI Compatible API server.
83-
```
90+
```sh
8491
vllm serve meta-llama/Llama-2-7b-hf
8592
```
8693
Chat - Generate chat completions via the running API server.
87-
```
94+
```sh
8895
# Directly connect to localhost API without arguments
8996
vllm chat
9097

@@ -95,7 +102,7 @@ vllm chat --url http://{vllm-serve-host}:{vllm-serve-port}/v1
95102
vllm chat --quick "hi"
96103
```
97104
Complete - Generate text completions based on the given prompt via the running API server.
98-
```
105+
```sh
99106
# Directly connect to localhost API without arguments
100107
vllm complete
101108

0 commit comments

Comments
 (0)