Skip to content

backport 1.10 - fix(lightspeed): pre-create /data/vector_db/notebooks in init container#3042

Open
JslYoon wants to merge 1 commit into
redhat-developer:release-1.10from
JslYoon:fix/RHDHBUGS-3371-lightspeed-notebooks-permissions
Open

backport 1.10 - fix(lightspeed): pre-create /data/vector_db/notebooks in init container#3042
JslYoon wants to merge 1 commit into
redhat-developer:release-1.10from
JslYoon:fix/RHDHBUGS-3371-lightspeed-notebooks-permissions

Conversation

@JslYoon

@JslYoon JslYoon commented Jun 22, 2026

Copy link
Copy Markdown

On EKS/AKS, the RAG init container copies /rag/. to /data/ but never creates the notebooks subdirectory. At runtime, llama-stack tries to write /rag-content/vector_db/notebooks/faiss_store.db (same volume, mounted at /rag-content in the sidecar) and fails with PermissionError because it cannot create the directory. OCP avoids this via fsGroup defaults; EKS/AKS do not.

The fix pre-creates /data/vector_db/notebooks before the existing chmod so the directory exists and is writable when the sidecar starts.

Fixes: RHDHBUGS-3371

Description

Which issue(s) does this PR fix or relate to

  • Fixes #issue_number

PR acceptance criteria

  • Tests
  • Documentation

How to test changes / Special notes to the reviewer

On EKS/AKS, the RAG init container copies /rag/. to /data/ but never
creates the notebooks subdirectory. At runtime, llama-stack tries to
write /rag-content/vector_db/notebooks/faiss_store.db (same volume,
mounted at /rag-content in the sidecar) and fails with PermissionError
because it cannot create the directory. OCP avoids this via fsGroup
defaults; EKS/AKS do not.

The fix pre-creates /data/vector_db/notebooks before the existing chmod
so the directory exists and is writable when the sidecar starts.

Fixes: RHDHBUGS-3371

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
@JslYoon JslYoon requested a review from a team as a code owner June 22, 2026 21:14
@openshift-ci openshift-ci Bot requested review from gazarenkov and rm3l June 22, 2026 21:14
@JslYoon JslYoon deployed to external June 22, 2026 21:14 — with GitHub Actions Active
@sonarqubecloud

Copy link
Copy Markdown

@rhdh-qodo-merge

Copy link
Copy Markdown

PR Summary by Qodo

Fix Lightspeed RAG permissions by creating vector_db/notebooks in init container
🐞 Bug fix ⚙️ Configuration changes 🕐 10-20 Minutes

Grey Divider

Description

• Create /data/vector_db/notebooks during RAG init copy to prevent runtime PermissionError.
• Ensure directory exists before chmod so sidecar can write FAISS store on EKS/AKS.
• Apply the same init-command fix across source manifests and generated install output.
Diagram

graph TD
  A["RAG init container"] --> B["Copy /rag -> /data"] --> C["mkdir -p notebooks"] --> D[("Shared PVC /data")] --> E["llama-stack sidecar"] --> F["Write faiss_store.db"]
  subgraph Legend
    direction LR
    _step["Container step"] ~~~ _vol[("Persistent volume")]
  end
Loading
High-Level Assessment

The following are alternative approaches to this PR:

1. Set fsGroup / runAsGroup on the pod securityContext
  • ➕ Aligns with Kubernetes-native permission management; avoids 777 chmod.
  • ➕ Applies broadly to all paths on the volume without per-directory handling.
  • ➖ Depends on cluster/storage behavior and security policies; may not work uniformly across CSI drivers.
  • ➖ May require broader manifest/security review and could have unintended permission effects.
2. Use chown/chmod with a non-world-writable mode (e.g., 775)
  • ➕ More secure than chmod -R 777 while still solving write access.
  • ➕ Keeps the fix localized to init container behavior.
  • ➖ Requires knowing/standardizing the runtime UID/GID of the sidecar.
  • ➖ More brittle if images/users change.
3. Make llama-stack create the directory on startup
  • ➕ Application-level robustness regardless of init container behavior.
  • ➖ Still fails if the parent path is not writable; doesn’t address underlying volume permission mismatch.
  • ➖ Requires app change/release rather than a manifest-only fix.

Recommendation: The PR’s approach (mkdir -p before existing chmod) is the fastest, lowest-risk fix for EKS/AKS because it directly addresses the missing directory that triggers the PermissionError. If security hardening is a follow-up goal, consider replacing chmod -R 777 with a tighter mode coupled with an explicit fsGroup/runAsGroup strategy once runtime UID/GID assumptions are validated.

Files changed (3) +3 / -3

Bug fix (3) +3 / -3
rhdh-flavour-lightspeed-config_v1_configmap.yamlCreate vector_db/notebooks during RAG init copy in ConfigMap manifest +1/-1

Create vector_db/notebooks during RAG init copy in ConfigMap manifest

• Updates the init container shell command to mkdir -p /data/vector_db/notebooks after copying /rag into /data. Ensures the notebooks directory exists before chmod -R 777 is applied.

bundle/rhdh/manifests/rhdh-flavour-lightspeed-config_v1_configmap.yaml

deployment.yamlCreate vector_db/notebooks in Lightspeed deployment init container command +1/-1

Create vector_db/notebooks in Lightspeed deployment init container command

• Adds mkdir -p /data/vector_db/notebooks to the init container command sequence. Prevents runtime failures when the sidecar tries to create/write the FAISS store under notebooks.

config/profile/rhdh/default-config/flavours/lightspeed/deployment.yaml

install.yamlPropagate init container notebooks directory creation into generated install manifest +1/-1

Propagate init container notebooks directory creation into generated install manifest

• Mirrors the same init container command change in the distribution install YAML. Keeps the generated installation output consistent with the source manifests.

dist/rhdh/install.yaml

@rhdh-qodo-merge

Copy link
Copy Markdown

Code Review by Qodo

🐞 Bugs (2) 📘 Rule violations (0) 📎 Requirement gaps (0) 📜 Skill insights (0)

Context used
✅ Tickets: RHDHBUGS-3371

Grey Divider


Remediation recommended

1. World-writable vector DB 🐞 Bug ⛨ Security
Description
The init container runs chmod -R 777 /data/vector_db, making the RAG vector database writable by
any UID inside the pod, which allows unintended tampering/corruption of the SQLite/FAISS stores used
by llama-stack. This is a least-privilege violation on a runtime-critical data path mounted into the
main container.
Code

config/profile/rhdh/default-config/flavours/lightspeed/deployment.yaml[12]

+            - "echo 'Copying RAG data...'; cp -r /rag/. /data/ && mkdir -p /data/vector_db/notebooks && chmod -R 777 /data/vector_db && echo 'Copy complete.'"
Evidence
The deployment mounts the same rag-data-volume at /data in the init container and at
/rag-content in the main container, and the init command applies chmod -R 777 to that shared
path. The lightspeed/llama-stack config stores its SQLite DB at
/rag-content/vector_db/notebooks/faiss_store.db, so broad write permissions apply to
runtime-critical DB files.

config/profile/rhdh/default-config/flavours/lightspeed/deployment.yaml[6-52]
config/profile/rhdh/default-config/flavours/lightspeed/configmap-files.yaml[163-176]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

### Issue description
The init container recursively applies world-writable permissions (`chmod -R 777`) to `/data/vector_db`, which is mounted into the runtime container at `/rag-content`. This broad permission can allow any process UID in the pod to modify or corrupt the vector DB files.

### Issue Context
The runtime configuration uses SQLite DB files under `/rag-content/vector_db/...`, so integrity of this directory matters.

### Fix Focus Areas
- config/profile/rhdh/default-config/flavours/lightspeed/deployment.yaml[6-52]
- bundle/rhdh/manifests/rhdh-flavour-lightspeed-config_v1_configmap.yaml[267-313]
- dist/rhdh/install.yaml[3288-3335]

### Suggested fix
Replace `chmod -R 777 /data/vector_db` with a least-privilege approach, e.g.:
- set pod/container `securityContext` (`runAsUser` + `fsGroup`) to the UID/GID the app runs as, and
- use `chmod -R g+rwX` (or `ug+rwX`) on the directory tree, and/or `chown -R <runtime_uid>:<runtime_gid> /data/vector_db` in the init container.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


2. No fail-fast on missing data 🐞 Bug ☼ Reliability
Description
Because the init command now runs mkdir -p /data/vector_db/notebooks, /data/vector_db will exist
even if the copied image content did not include the required /rag/vector_db tree, so init may
succeed and defer the failure to runtime. This can make incorrectly-packaged/missing RAG content
harder to diagnose since llama-stack expects DB files under /rag-content/vector_db/....
Code

config/profile/rhdh/default-config/flavours/lightspeed/deployment.yaml[12]

+            - "echo 'Copying RAG data...'; cp -r /rag/. /data/ && mkdir -p /data/vector_db/notebooks && chmod -R 777 /data/vector_db && echo 'Copy complete.'"
Evidence
The init container performs a broad cp -r /rag/. /data/ and then unconditionally ensures
/data/vector_db/notebooks exists, which guarantees /data/vector_db exists even if the copy did
not include it. The runtime configuration explicitly points SQLite DB paths at
/rag-content/vector_db/..., so missing packaged content will surface as runtime DB-open errors
instead of an init-time failure.

config/profile/rhdh/default-config/flavours/lightspeed/deployment.yaml[6-46]
config/profile/rhdh/default-config/flavours/lightspeed/configmap-files.yaml[163-176]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

### Issue description
The new `mkdir -p /data/vector_db/notebooks` changes behavior so the init container can succeed even if `cp -r /rag/. /data/` did not bring in the expected `vector_db` content. This shifts detection from init-time to runtime, where failures are harder to attribute.

### Issue Context
Runtime config references DB paths under `/rag-content/vector_db/...`, so missing packaged content should ideally fail the init container with a clear error.

### Fix Focus Areas
- config/profile/rhdh/default-config/flavours/lightspeed/deployment.yaml[6-52]
- bundle/rhdh/manifests/rhdh-flavour-lightspeed-config_v1_configmap.yaml[267-313]
- dist/rhdh/install.yaml[3288-3335]

### Suggested fix
After the copy, add explicit checks for required directories/files (example):
- `test -d /data/vector_db/rhdh_product_docs || { echo 'ERROR: missing vector_db content'; exit 1; }`
- optionally `test -f /data/vector_db/rhdh_product_docs/1.10/faiss_store.db || ...`
Then run the `mkdir -p /data/vector_db/notebooks` and permission adjustments.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


Grey Divider

Qodo Logo

@JslYoon JslYoon changed the title fix(lightspeed): pre-create /data/vector_db/notebooks in init container backport 1.10 - fix(lightspeed): pre-create /data/vector_db/notebooks in init container Jun 23, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant