Our aim is to serve Qwen/Qwen3-Embedding-0.6B and Qwen/Qwen3-Reranker-0.6B using Ray Serve, specificially, we're aiming to follow the Multi-application design pattern.
OPTIONAL (if you're CUDA drivers aren't updated, etc.): On VSCode, set up the devcontainer.json by clicking CTRL + SHIFT + p > Reopen in Container.
Set up uv. Really goated package manager. It's blazing fast! Other methods here.
pip install --upgrade pip \
pip install uv \
# uv self updateOnce everything is set up run the below:
uv venv
source .venv/bin/activate
uv pip install -r pyproject.toml --group dev # to add dev dependenciesRun the below and you should see the dashboard pop up at http://localhost:8265/#/serve.*
serve build app.text_embedding:app -o config.yaml # generate `config.yaml` (if you haven't)
ray start --head --dashboard-port=8265
serve run config.yaml
ray stop # shut down ray cluster once your done testingdocker build -t ray-embedding-service .
docker run -it --rm --gpus all -p 8000:8000 -p 8265:8265 -p 6379:6379 ray-embedding-service
- Figure out why Ray Dashboard isn't showing up at port 8265
- Serve Reranker model
- Add dynamic check to see if Ray Cluster is up in
scripts/entrypoint.sh - Use smaller Docker Image for
Dockerfile
- Multi-application for Ray Serve (main project inspiration): https://docs.ray.io/en/latest/serve/multi-app.html
- For FastAPI integration: https://github.com/ray-project/ray/blob/cfcc68f13798eb5c2c9888a089d4b9c95d21b7fc/python/ray/serve/tests/test_fastapi.py#L153-L325
- How to install
flash-attnwith--no-build-isolationusinguv: astral-sh/uv#6437 (comment) & https://docs.astral.sh/uv/concepts/projects/config/#build-isolation - Devcontainer: Diff between Remote & Container Users: https://stackoverflow.com/questions/67468439/vs-code-devcontainers-what-is-the-difference-between-remoteuser-and-containeru
- Ray Dashboard is empty: https://discuss.ray.io/t/ray-dashboard-is-empty/12883/6 *(we solved this by bumping up Ray version from 2.8.2 to 2.9.0)
