Skip to content

Commit 70a44f9

Browse files
dminnear-rhgaurav-nelson
authored andcommitted
add docs for rag-llm-cpu pattern
1 parent cce2b26 commit 70a44f9

8 files changed

Lines changed: 457 additions & 0 deletions

File tree

Lines changed: 74 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,74 @@
1+
---
2+
title: RAG LLM Chatbot on CPU
3+
date: 2025-10-24
4+
tier: sandbox
5+
summary: This patterns deploys a CPU-based LLM, your choice of several RAG DB providers, and a simple chatbot UI which exposes the configuration and results of the RAG queries.
6+
rh_products:
7+
- Red Hat OpenShift Container Platform
8+
- Red Hat OpenShift GitOps
9+
- Red Hat OpenShift AI
10+
partners:
11+
- Microsoft
12+
- IBM Fusion
13+
industries:
14+
- General
15+
aliases: /rag-llm-cpu/
16+
links:
17+
github: https://github.com/validatedpatterns-sandbox/rag-llm-cpu
18+
install: getting-started
19+
bugs: https://github.com/validatedpatterns-sandbox/rag-llm-cpu/issues
20+
feedback: https://docs.google.com/forms/d/e/1FAIpQLScI76b6tD1WyPu2-d_9CCVDr3Fu5jYERthqLKJDUGwqBg7Vcg/viewform
21+
---
22+
23+
# **CPU-based RAG LLM chatbot**
24+
25+
## **Introduction**
26+
27+
The CPU-based RAG LLM chatbot Validated Pattern deploys a retrieval-augmented generation (RAG) chatbot on Red Hat OpenShift by using Red Hat OpenShift AI.
28+
The pattern runs entirely on CPU nodes without requiring GPU hardware, which provides a cost-effective and accessible solution for environments where GPU resources are limited or unavailable.
29+
This pattern provides a secure, flexible, and production-ready starting point for building and deploying on-premise generative AI applications.
30+
31+
## **Target audience**
32+
33+
This pattern is intended for the following users:
34+
35+
- **Developers & Data Scientists** who want to build and experiment with RAG-based large language model (LLM) applications.
36+
- **MLOps & DevOps Engineers** who are responsible for deploying and managing AI/ML workloads on OpenShift.
37+
- **Architects** who evaluate cost-effective methods for delivering generative AI capabilities on-premise.
38+
39+
## **Why Use This Pattern?**
40+
41+
- **Cost-Effective**: The pattern runs entirely on CPU nodes, which removes the need for expensive and scarce GPU resources.
42+
- **Flexible**: The pattern supports multiple vector database backends, such as Elasticsearch, PGVector, and Microsoft SQL Server, to integrate with existing data infrastructure.
43+
- **Transparent**: The Gradio frontend exposes the internals of the RAG query and LLM prompts, which provides insight into the generation process.
44+
- **Extensible**: The pattern uses open-source standards, such as KServe and OpenAI-compatible APIs, to serve as a foundation for complex applications.
45+
46+
## **Architecture Overview**
47+
48+
At a high level, the components work together in the following sequence:
49+
50+
1. A user enters a query into the **Gradio UI**.
51+
2. The backend application, using **LangChain**, queries a configured **vector database** to retrieve relevant documents.
52+
3. These documents are combined with the original query from the user into a prompt.
53+
4. The prompt is sent to the **KServe-deployed LLM**, which runs via **llama.cpp** on a CPU node.
54+
5. The LLM generates a response, which is streamed back to the **Gradio UI**.
55+
6. **Vault** provides the necessary credentials for the vector database and HuggingFace token at runtime.
56+
57+
![Overview](/images/rag-llm-cpu/rag-augmented-query.png)
58+
59+
_Figure 1. Overview of RAG Query from User's perspective._
60+
61+
## **Prerequisites**
62+
63+
Before you begin, ensure that you have access to the following resources:
64+
65+
- A Red Hat OpenShift cluster version 4.x. (The recommended size is at least two `m5.4xlarge` nodes.)
66+
- A HuggingFace API token.
67+
- The `Podman` command-line tool.
68+
69+
## **What This Pattern Provides**
70+
71+
- A [KServe](https://github.com/kserve/kserve)-based LLM deployed to [Red Hat OpenShift AI](https://www.redhat.com/en/products/ai/openshift-ai) that runs entirely on a CPU-node with a [llama.cpp](https://github.com/ggml-org/llama.cpp) runtime.
72+
- A choice of one or more vector database providers to serve as a RAG backend with configurable web-based or Git repository-based sources. Vector embedding and document retrieval are implemented with [LangChain](https://docs.langchain.com/oss/python/langchain/overview).
73+
- [Vault](https://developer.hashiCorp.com/vault)-based secret management for a HuggingFace API token and credentials for supported databases, such as ([Elasticsearch](https://www.elastic.co/docs/solutions/search/vector), [PGVector](https://github.com/pgvector/pgvector), [Microsoft SQL Server](https://learn.microsoft.com/en-us/sql/sql-server/ai/vectors?view=sql-server-ver17)).
74+
- A [gradio](https://www.gradio.app/)-based frontend for connecting to multiple [OpenAI API-compatible](https://github.com/openai/openai-openapi) LLMs. This frontend exposes the internals of the RAG query and LLM prompts so that users have insight into the running processes.
Lines changed: 286 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,286 @@
1+
---
2+
title: Configuring this pattern
3+
weight: 20
4+
aliases: /rag-llm-cpu/configure/
5+
---
6+
7+
# **Configuring this pattern**
8+
9+
This guide covers common customizations, such as changing the default large language model (LLM), adding new models, and configuring retrieval-augmented generation (RAG) data sources. This guide assumes that you have already completed the [Getting started](/rag-llm-cpu/getting-started/) guide.
10+
11+
## **Configuration overview**
12+
13+
ArgoCD manages this pattern by using GitOps. All application configurations are defined in the `values-prod.yaml` file. To customize a component, complete the following steps:
14+
15+
1. **Enable an override:** In the `values-prod.yaml` file, locate the application that you want to change, such as `llm-inference-service`, and add an `extraValueFiles:` entry that points to a new override file, such as `$patternref/overrides/llm-inference-service.yaml`.
16+
2. **Create the override file:** Create the new `.yaml` file in the `/overrides` directory.
17+
3. **Add settings:** Add the specific values that you want to change to the new file.
18+
4. **Commit and synchronize:** Commit your changes and allow ArgoCD to synchronize the application.
19+
20+
## **Task: Changing the default LLM**
21+
22+
By default, the pattern deploys the `mistral-7b-instruct-v0.2.Q5_0.gguf` model. You can change this to a different model, such as a different quantization, or adjust the resource usage. To change the default LLM, create an override file for the existing `llm-inference-service` application.
23+
24+
1. **Enable the override:**
25+
In the `values-prod.yaml` file, update the `llm-inference-service` application to use an override file:
26+
```yaml
27+
clusterGroup:
28+
# ...
29+
applications:
30+
# ...
31+
llm-inference-service:
32+
name: llm-inference-service
33+
namespace: rag-llm-cpu
34+
chart: llm-inference-service
35+
chartVersion: 0.3.*
36+
extraValueFiles: # <-- ADD THIS BLOCK
37+
- $patternref/overrides/llm-inference-service.yaml
38+
```
39+
40+
2. **Create the override file:**
41+
Create a new file named `overrides/llm-inference-service.yaml`. The following example switches to a different model file (Q8_0) and increases the CPU and memory requests:
42+
```yaml
43+
inferenceService:
44+
resources: # <-- Increaed allocated resources
45+
requests:
46+
cpu: "8"
47+
memory: 12Gi
48+
limits:
49+
cpu: "12"
50+
memory: 24Gi
51+
52+
servingRuntime:
53+
args:
54+
- --model
55+
- /models/mistral-7b-instruct-v0.2.Q8_0.gguf # <-- Changed model file
56+
57+
model:
58+
repository: TheBloke/Mistral-7B-Instruct-v0.2-GGUF
59+
files:
60+
- mistral-7b-instruct-v0.2.Q8_0.gguf # <-- Changed file to download
61+
```
62+
63+
## **Task: Adding a second LLM**
64+
65+
You can deploy an additional LLM and add it to the demonstration user interface (UI). The following example deploys the HuggingFace TGI runtime instead of `llama.cpp`. This process requires two steps: deploying the new LLM and configuring the frontend UI.
66+
67+
### **Step 1: Deploying the new LLM service**
68+
69+
1. **Define the new application:**
70+
In the `values-prod.yaml` file, add a new application named `another-llm-inference-service` to the applications list.
71+
72+
```yaml
73+
clusterGroup:
74+
# ...
75+
applications:
76+
# ...
77+
another-llm-inference-service: # <-- ADD THIS NEW APPLICATION
78+
name: another-llm-inference-service
79+
namespace: rag-llm-cpu
80+
chart: llm-inference-service
81+
chartVersion: 0.3.*
82+
extraValueFiles:
83+
- $patternref/overrides/another-llm-inference-service.yaml
84+
```
85+
86+
2. **Create the override file:**
87+
Create a new file named `overrides/another-llm-inference-service.yaml`. This file defines the new model and disables the creation of resources, such as secrets, that the first LLM already created.
88+
```yaml
89+
dsc:
90+
initialize: false
91+
externalSecret:
92+
create: false
93+
94+
# Define the new InferenceService
95+
inferenceService:
96+
name: hf-inference-service # <-- New service name
97+
minReplicas: 1
98+
maxReplicas: 1
99+
resources:
100+
requests:
101+
cpu: "8"
102+
memory: 32Gi
103+
limits:
104+
cpu: "12"
105+
memory: 32Gi
106+
107+
# Define the new runtime (HuggingFace TGI)
108+
servingRuntime:
109+
name: hf-runtime
110+
port: 8080
111+
image: docker.io/kserve/huggingfaceserver:latest
112+
modelFormat: huggingface
113+
args:
114+
- --model_dir
115+
- /models
116+
- --model_name
117+
- /models/Mistral-7B-Instruct-v0.3
118+
- --http_port
119+
- "8080"
120+
121+
# Define the new model to download
122+
model:
123+
repository: mistralai/Mistral-7B-Instruct-v0.3
124+
files:
125+
- generation_config.json
126+
- config.json
127+
- model.safetensors.index.json
128+
- model-00001-of-00003.safetensors
129+
- model-00002-of-00003.safetensors
130+
- model-00003-of-00003.safetensors
131+
- tokenizer.model
132+
- tokenizer.json
133+
- tokenizer_config.json
134+
```
135+
136+
> **IMPORTANT:** A known issue in the model-downloading container requires that you explicitly list all files that you want to download from the HuggingFace repository. Ensure that you list every file required for the model to run.
137+
138+
### **Step 2: Adding the new LLM to the demonstration UI**
139+
140+
Configure the frontend to recognize the new LLM.
141+
142+
1. **Edit the frontend overrides**:
143+
Open the `overrides/rag-llm-frontend-values.yaml` file.
144+
2. **Update LLM_URLS:**
145+
Add the URL of the new service to the `LLM_URLS` environment variable. The URL uses the `http://<service-name>-predictor/v1` format or `http://<service-name>-predictor/openai/v1` for the HuggingFace runtime.
146+
In the `overrides/rag-llm-frontend-values.yaml` file:
147+
148+
```yaml
149+
env:
150+
# ...
151+
- name: LLM_URLS
152+
value: '["http://cpu-inference-service-predictor/v1","http://hf-inference-service-predictor/openai/v1"]'
153+
```
154+
155+
## **Task: Customizing RAG data sources**
156+
157+
By default, the pattern ingests data from the Validated Patterns documentation. You can change this to point to public Git repositories or web pages.
158+
159+
1. **Edit the vector database overrides:**
160+
Open the `overrides/vector-db-values.yaml` file.
161+
2. **Update sources:**
162+
Modify the `repoSources` and `webSources` keys. You can add any publicly available Git repository or public web URL. The job also processes PDF files from `webSources`.
163+
In the `overrides/vector-db-values.yaml` file:
164+
165+
```yaml
166+
providers:
167+
qdrant:
168+
enabled: true
169+
mssql:
170+
enabled: true
171+
172+
vectorEmbedJob:
173+
repoSources:
174+
- repo: https://github.com/your-org/your-docs.git # <-- Your repo
175+
globs:
176+
- "**/*.md"
177+
webSources:
178+
- https://your-company.com/product-manual.pdf # <-- Your PDF
179+
chunking:
180+
size: 4096
181+
```
182+
183+
## **Task: Adding a new RAG database provider**
184+
185+
By default, the pattern enables `qdrant` and `mssql`. You can also enable `redis`, `pgvector`, or `elastic`. This process requires three steps: adding secrets, enabling the database, and configuring the UI.
186+
187+
### **Step 1: Updating the secrets file**
188+
189+
1. If the new database requires credentials, add them to the main secrets file:
190+
191+
```sh
192+
vim ~/values-secret-rag-llm-cpu.yaml
193+
```
194+
2. Add the necessary credentials. For example:
195+
196+
```yaml
197+
secrets:
198+
# ...
199+
- name: pgvector
200+
fields:
201+
- name: user
202+
value: user # <-- Update the user
203+
- name: password
204+
value: password # <-- Update the password
205+
- name: db
206+
value: db # <-- Update the db
207+
```
208+
209+
> **NOTE:** For information about the expected values, see the [`values-secret.yaml.template`](https://github.com/validatedpatterns-sandbox/rag-llm-cpu/blob/main/values-secret.yaml.template) file.
210+
211+
### **Step 2: Enabling the provider in the vector database chart**
212+
213+
Edit the `overrides/vector-db-values.yaml` file and set `enabled: true` for the providers that you want to add.
214+
215+
In the `overrides/vector-db-values.yaml` file:
216+
217+
```yaml
218+
providers:
219+
qdrant:
220+
enabled: true
221+
mssql:
222+
enabled: true
223+
pgvector: # <-- ADD THIS
224+
enabled: true
225+
elastic: # <-- OR THIS
226+
enabled: true
227+
```
228+
229+
### **Step 3: Adding the provider to the demonstration UI**
230+
231+
Edit the `overrides/rag-llm-frontend-values.yaml` file to configure the UI:
232+
233+
1. Add the secrets for the new provider to the `dbProvidersSecret.vault` list.
234+
2. Add the connection details for the new provider to the `dbProvidersSecret.providers` list.
235+
236+
The following example shows the configuration for non-default RAG database providers:
237+
238+
In the `overrides/rag-llm-frontend-values.yaml` file:
239+
240+
```yaml
241+
dbProvidersSecret:
242+
vault:
243+
- key: mssql
244+
field: sapassword
245+
- key: pgvector # <-- Add this block
246+
field: user
247+
- key: pgvector
248+
field: password
249+
- key: pgvector
250+
field: db
251+
- key: elastic # <-- Add this block
252+
field: user
253+
- key: elastic
254+
field: password
255+
providers:
256+
- type: qdrant # <-- Example for Qdrant
257+
collection: docs
258+
url: http://qdrant-service:6333
259+
embedding_model: sentence-transformers/all-mpnet-base-v2
260+
- type: mssql # <-- Example for MSSQL
261+
table: docs
262+
connection_string: >-
263+
Driver={ODBC Driver 18 for SQL Server};
264+
Server=mssql-service,1433;
265+
Database=embeddings;
266+
UID=sa;
267+
PWD={{ .mssql_sapassword }};
268+
TrustServerCertificate=yes;
269+
Encrypt=no;
270+
embedding_model: sentence-transformers/all-mpnet-base-v2
271+
- type: redis # <-- Example for Redis
272+
index: docs
273+
url: redis://redis-service:6379
274+
embedding_model: sentence-transformers/all-mpnet-base-v2
275+
- type: elastic # <-- Example for Elastic
276+
index: docs
277+
url: http://elastic-service:9200
278+
user: "{{ .elastic_user }}"
279+
password: "{{ .elastic_password }}"
280+
embedding_model: sentence-transformers/all-mpnet-base-v2
281+
- type: pgvector # <-- Example for PGVector
282+
collection: docs
283+
url: >-
284+
postgresql+psycopg://{{ .pgvector_user }}:{{ .pgvector_password }}@pgvector-service:5432/{{ .pgvector_db }}
285+
embedding_model: sentence-transformers/all-mpnet-base-v2
286+
```

0 commit comments

Comments
 (0)