Skip to content

Commit 34eb7bb

Browse files
authored
Merge pull request #23 from royisme/codex/refactor-neo4j-knowledge-service-pipeline
Refactor Neo4j ingestion through configurable pipelines
2 parents 588bb37 + 67a2f34 commit 34eb7bb

5 files changed

Lines changed: 647 additions & 179 deletions

File tree

docs/getting-started/configuration.md

Lines changed: 38 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -184,6 +184,44 @@ OPERATION_TIMEOUT=300
184184
LARGE_DOCUMENT_TIMEOUT=600
185185
```
186186

187+
### Ingestion Pipelines
188+
189+
Document ingestion now uses [LlamaIndex ingestion pipelines](https://docs.llamaindex.ai/) with pluggable connectors, transformations, and writers. The service ships with three pipelines (`manual_input`, `file`, `directory`), and you can override or extend them from configuration by providing a JSON-style mapping in your `.env` file:
190+
191+
```bash
192+
INGESTION_PIPELINES='{
193+
"file": {
194+
"transformations": [
195+
{
196+
"class_path": "llama_index.core.node_parser.SimpleNodeParser",
197+
"kwargs": {"chunk_size": 256, "chunk_overlap": 20}
198+
},
199+
{
200+
"class_path": "codebase_rag.services.knowledge.pipeline_components.MetadataEnrichmentTransformation",
201+
"kwargs": {"metadata": {"language": "python"}}
202+
}
203+
]
204+
},
205+
"git": {
206+
"connector": {
207+
"class_path": "my_project.pipeline.GitRepositoryConnector",
208+
"kwargs": {"branch": "main"}
209+
},
210+
"transformations": [
211+
{
212+
"class_path": "my_project.pipeline.CodeBlockParser",
213+
"kwargs": {"max_tokens": 400}
214+
}
215+
],
216+
"writer": {
217+
"class_path": "codebase_rag.services.knowledge.pipeline_components.Neo4jKnowledgeGraphWriter"
218+
}
219+
}
220+
}'
221+
```
222+
223+
Each entry is merged with the defaults. This means you can change chunking behaviour, add metadata enrichment steps, or register new data sources by publishing your own connector class. At runtime the knowledge service builds and reuses the configured pipeline instances so changes only require a service restart.
224+
187225
### Neo4j Performance Tuning
188226

189227
For large repositories:

src/codebase_rag/config/settings.py

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,7 @@
77

88
from pydantic_settings import BaseSettings
99
from pydantic import Field
10-
from typing import Optional, Literal
10+
from typing import Optional, Literal, Dict, Any
1111

1212

1313
class Settings(BaseSettings):
@@ -96,6 +96,10 @@ class Settings(BaseSettings):
9696
# Document Processing Settings
9797
max_document_size: int = Field(default=10 * 1024 * 1024, description="Maximum document size in bytes (10MB)")
9898
max_payload_size: int = Field(default=50 * 1024 * 1024, description="Maximum task payload size for storage (50MB)")
99+
ingestion_pipelines: Dict[str, Dict[str, Any]] = Field(
100+
default_factory=dict,
101+
description="Optional ingestion pipeline overrides",
102+
)
99103

100104
# API Settings
101105
cors_origins: list = Field(default=["*"], description="CORS allowed origins")

0 commit comments

Comments
 (0)