Skip to content

wanadzhar913/multi-application-model-serving-ray-serve

Repository files navigation

Introduction

Our aim is to serve Qwen/Qwen3-Embedding-0.6B and Qwen/Qwen3-Reranker-0.6B using Ray Serve, specificially, we're aiming to follow the Multi-application design pattern.

How to setup your environment for testing & development

OPTIONAL (if you're CUDA drivers aren't updated, etc.): On VSCode, set up the devcontainer.json by clicking CTRL + SHIFT + p > Reopen in Container.

Set up uv. Really goated package manager. It's blazing fast! Other methods here.

pip install --upgrade pip \
pip install uv \
# uv self update

Once everything is set up run the below:

uv venv
source .venv/bin/activate
uv pip install -r pyproject.toml --group dev # to add dev dependencies

How to deploy the project (Locally)

Run the below and you should see the dashboard pop up at http://localhost:8265/#/serve.*

serve build app.text_embedding:app -o config.yaml # generate `config.yaml` (if you haven't)
ray start --head --dashboard-port=8265
serve run config.yaml

ray stop # shut down ray cluster once your done testing

dashboard-screenshot

How to deploy the project (with Docker)

docker build -t ray-embedding-service .
docker run -it --rm --gpus all -p 8000:8000 -p 8265:8265 -p 6379:6379 ray-embedding-service

To do's

  • Figure out why Ray Dashboard isn't showing up at port 8265
  • Serve Reranker model
  • Add dynamic check to see if Ray Cluster is up in scripts/entrypoint.sh
  • Use smaller Docker Image for Dockerfile

Resources

About

This small project serves Qwen3 embedding and reranker models using Ray Serve, specificially following the Multi-Application pattern on the Ray Serve docs.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors