RAG System Implementation for OpenProject Haystack

This document describes the Retrieval-Augmented Generation (RAG) system implementation that enhances project status reports with PMFlex methodology context and templates.

Overview

The RAG system integrates PMFlex (German Federal Government Project Management) documents into the report generation process, providing context-aware, methodology-compliant project status reports.

Architecture

Core Components

Document Processor (src/services/document_processor.py)
- Processes PDF, DOCX, PPTX, and TXT files
- Extracts text with metadata preservation
- Chunks documents with configurable overlap
- Supports multiple document formats
Vector Store (src/services/vector_store.py)
- FAISS-based vector storage for embeddings
- Sentence-transformers for text embeddings
- Similarity search with configurable thresholds
- Persistent storage with metadata
Document Manager (src/services/document_manager.py)
- Manages document lifecycle and indexing
- Handles document updates and refresh
- Provides statistics and validation
- Maintains document metadata index
RAG Pipeline (src/pipelines/rag_pipeline.py)
- High-level RAG orchestration
- Context enhancement for reports
- Template and methodology retrieval
- Integration with generation pipeline

Integration Points

Generation Pipeline: Enhanced with RAG context retrieval
Report Templates: PMFlex-aware templates with context integration
API Endpoints: RAG management and status endpoints
Startup Process: Automatic RAG system initialization

Document Structure

documents/
├── pmflex/
│   ├── templates/
│   │   ├── pmflex_status_report_template.txt
│   │   └── [other templates]
│   ├── handbooks/
│   │   ├── pmflex_methodology_guide.txt
│   │   └── [other handbooks]
│   └── metadata/
│       └── document_index.json

Supported Document Formats

PDF: Full text extraction with page metadata
DOCX: Paragraph-based text extraction
PPTX: Slide-based text extraction
TXT: Plain text processing

Configuration

Environment Variables

# RAG Configuration
DOCUMENTS_PATH=documents
VECTOR_STORE_PATH=vector_store
EMBEDDING_MODEL=all-MiniLM-L6-v2
CHUNK_SIZE=800
CHUNK_OVERLAP=100
MAX_RETRIEVED_DOCS=5

Settings (`config/settings.py`)

# RAG Configuration
DOCUMENTS_PATH: str = os.getenv("DOCUMENTS_PATH", "documents")
VECTOR_STORE_PATH: str = os.getenv("VECTOR_STORE_PATH", "vector_store")
EMBEDDING_MODEL: str = os.getenv("EMBEDDING_MODEL", "all-MiniLM-L6-v2")
CHUNK_SIZE: int = int(os.getenv("CHUNK_SIZE", "800"))
CHUNK_OVERLAP: int = int(os.getenv("CHUNK_OVERLAP", "100"))
MAX_RETRIEVED_DOCS: int = int(os.getenv("MAX_RETRIEVED_DOCS", "5"))

API Endpoints

RAG Management

Initialize RAG System

POST /rag/initialize

Initializes the RAG system by loading and processing all PMFlex documents.

Get RAG Status

GET /rag/status

Returns current RAG system status, statistics, and validation results.

Refresh Documents

POST /rag/refresh

Refreshes the document index by checking for updates and reprocessing changed files.

Search Documents

POST /rag/search?query=<search_query>&max_results=5

Searches RAG documents for specific information.

Enhanced Report Generation

The existing /generate-project-status-report endpoint now automatically includes PMFlex context when available.

Usage Examples

Basic RAG System Setup

Add PMFlex Documents

# Place documents in the appropriate directories
documents/pmflex/templates/
documents/pmflex/handbooks/

Initialize System

curl -X POST http://localhost:8000/rag/initialize

Check Status
```
curl http://localhost:8000/rag/status
```

Enhanced Report Generation

curl -X POST http://localhost:8000/generate-project-status-report \
  -H "Content-Type: application/json" \
  -d '{
    "project": {
      "id": 123,
      "type": "project"
    },
    "openproject": {
      "base_url": "https://openproject.example.com",
      "user_token": "your-api-token"
    }
  }'

The generated report will now include:

PMFlex methodology guidance
Relevant templates and best practices
Compliance requirements
German federal government standards

Document Processing Workflow

Document Discovery: Scans documents/pmflex/ for supported files
Text Extraction: Extracts text based on file format
Chunking: Splits text into overlapping chunks
Embedding: Generates vector embeddings using sentence-transformers
Indexing: Stores embeddings and metadata in FAISS index
Persistence: Saves index and metadata to disk

Context Enhancement Process

Query Analysis: Analyzes project type and work package data
Template Retrieval: Searches for relevant PMFlex templates
Methodology Context: Retrieves applicable methodology guidance
Governance Context: Finds compliance and governance requirements
Context Combination: Merges contexts for report generation
Report Enhancement: Integrates context into LLM prompt

Performance Considerations

Embedding Model

Default: all-MiniLM-L6-v2 (384 dimensions)
Fast inference with good quality
Suitable for German and English text

Vector Store

FAISS IndexFlatIP for cosine similarity
In-memory index with disk persistence
Efficient for small to medium document collections

Chunking Strategy

800 characters per chunk with 100 character overlap
Preserves context across chunk boundaries
Balances retrieval granularity and context

Monitoring and Maintenance

Health Checks

RAG system status endpoint
Document index validation
Vector store statistics
Embedding model availability

Document Management

Automatic change detection
Incremental updates
Document metadata tracking
Error handling and logging

Performance Metrics

Document processing times
Embedding generation speed
Search response times
Context retrieval accuracy

Troubleshooting

Common Issues

No Documents Found
- Check document directory structure
- Verify file formats are supported
- Review file permissions
Embedding Model Loading
- Ensure internet connectivity for model download
- Check available disk space
- Verify sentence-transformers installation
Vector Store Errors
- Check FAISS installation
- Verify write permissions for vector_store directory
- Review memory availability
Context Quality Issues
- Adjust chunk size and overlap settings
- Review document content quality
- Tune similarity thresholds

Logging

RAG system activities are logged with appropriate levels:

INFO: Normal operations and status
WARNING: Non-critical issues
ERROR: System errors requiring attention

Future Enhancements

Planned Features

Multi-language support for German/English documents
Advanced retrieval strategies (hybrid search)
Document versioning and change tracking
Real-time document updates
Custom embedding fine-tuning

Scalability Improvements

Distributed vector storage
Async document processing
Caching layer for frequent queries
Batch processing capabilities

Dependencies

Required Packages

PyMuPDF>=1.23.0          # PDF processing
python-docx>=0.8.11      # Word document processing
python-pptx>=0.6.21      # PowerPoint processing
faiss-cpu>=1.7.4         # Vector similarity search
sentence-transformers>=2.2.2  # Text embeddings
nltk>=3.8.1              # Text processing utilities

System Requirements

Python 3.9+
4GB+ RAM (for embedding model)
1GB+ disk space (for vector store)
Internet connectivity (initial model download)

Security Considerations

Document access controls
Sensitive information handling
API endpoint security
Data encryption at rest
Audit logging capabilities

Contributing

When adding new documents or modifying the RAG system:

Follow the established directory structure
Test document processing with sample files
Validate embedding quality and retrieval accuracy
Update documentation and examples
Consider performance impact of changes

For questions or support, contact the development team or refer to the main project documentation.

FilesExpand file tree

RAG_SYSTEM_README.md

Latest commit

History

RAG_SYSTEM_README.md

File metadata and controls

RAG System Implementation for OpenProject Haystack

Overview

Architecture

Core Components

Integration Points

Document Structure

Supported Document Formats

Configuration

Environment Variables

Settings (config/settings.py)

API Endpoints

RAG Management

Initialize RAG System

Get RAG Status

Refresh Documents

Search Documents

Enhanced Report Generation

Usage Examples

Basic RAG System Setup

Enhanced Report Generation

Document Processing Workflow

Context Enhancement Process

Performance Considerations

Embedding Model

Vector Store

Chunking Strategy

Monitoring and Maintenance

Health Checks

Document Management

Performance Metrics

Troubleshooting

Common Issues

Logging

Future Enhancements

Planned Features

Scalability Improvements

Dependencies

Required Packages

System Requirements

Security Considerations

Contributing

Settings (`config/settings.py`)