This document describes the Retrieval-Augmented Generation (RAG) system implementation that enhances project status reports with PMFlex methodology context and templates.
The RAG system integrates PMFlex (German Federal Government Project Management) documents into the report generation process, providing context-aware, methodology-compliant project status reports.
-
Document Processor (
src/services/document_processor.py)- Processes PDF, DOCX, PPTX, and TXT files
- Extracts text with metadata preservation
- Chunks documents with configurable overlap
- Supports multiple document formats
-
Vector Store (
src/services/vector_store.py)- FAISS-based vector storage for embeddings
- Sentence-transformers for text embeddings
- Similarity search with configurable thresholds
- Persistent storage with metadata
-
Document Manager (
src/services/document_manager.py)- Manages document lifecycle and indexing
- Handles document updates and refresh
- Provides statistics and validation
- Maintains document metadata index
-
RAG Pipeline (
src/pipelines/rag_pipeline.py)- High-level RAG orchestration
- Context enhancement for reports
- Template and methodology retrieval
- Integration with generation pipeline
- Generation Pipeline: Enhanced with RAG context retrieval
- Report Templates: PMFlex-aware templates with context integration
- API Endpoints: RAG management and status endpoints
- Startup Process: Automatic RAG system initialization
documents/
├── pmflex/
│ ├── templates/
│ │ ├── pmflex_status_report_template.txt
│ │ └── [other templates]
│ ├── handbooks/
│ │ ├── pmflex_methodology_guide.txt
│ │ └── [other handbooks]
│ └── metadata/
│ └── document_index.json
- PDF: Full text extraction with page metadata
- DOCX: Paragraph-based text extraction
- PPTX: Slide-based text extraction
- TXT: Plain text processing
# RAG Configuration
DOCUMENTS_PATH=documents
VECTOR_STORE_PATH=vector_store
EMBEDDING_MODEL=all-MiniLM-L6-v2
CHUNK_SIZE=800
CHUNK_OVERLAP=100
MAX_RETRIEVED_DOCS=5# RAG Configuration
DOCUMENTS_PATH: str = os.getenv("DOCUMENTS_PATH", "documents")
VECTOR_STORE_PATH: str = os.getenv("VECTOR_STORE_PATH", "vector_store")
EMBEDDING_MODEL: str = os.getenv("EMBEDDING_MODEL", "all-MiniLM-L6-v2")
CHUNK_SIZE: int = int(os.getenv("CHUNK_SIZE", "800"))
CHUNK_OVERLAP: int = int(os.getenv("CHUNK_OVERLAP", "100"))
MAX_RETRIEVED_DOCS: int = int(os.getenv("MAX_RETRIEVED_DOCS", "5"))POST /rag/initializeInitializes the RAG system by loading and processing all PMFlex documents.
GET /rag/statusReturns current RAG system status, statistics, and validation results.
POST /rag/refreshRefreshes the document index by checking for updates and reprocessing changed files.
POST /rag/search?query=<search_query>&max_results=5Searches RAG documents for specific information.
The existing /generate-project-status-report endpoint now automatically includes PMFlex context when available.
-
Add PMFlex Documents
# Place documents in the appropriate directories documents/pmflex/templates/ documents/pmflex/handbooks/ -
Initialize System
curl -X POST http://localhost:8000/rag/initialize
-
Check Status
curl http://localhost:8000/rag/status
curl -X POST http://localhost:8000/generate-project-status-report \
-H "Content-Type: application/json" \
-d '{
"project": {
"id": 123,
"type": "project"
},
"openproject": {
"base_url": "https://openproject.example.com",
"user_token": "your-api-token"
}
}'The generated report will now include:
- PMFlex methodology guidance
- Relevant templates and best practices
- Compliance requirements
- German federal government standards
- Document Discovery: Scans
documents/pmflex/for supported files - Text Extraction: Extracts text based on file format
- Chunking: Splits text into overlapping chunks
- Embedding: Generates vector embeddings using sentence-transformers
- Indexing: Stores embeddings and metadata in FAISS index
- Persistence: Saves index and metadata to disk
- Query Analysis: Analyzes project type and work package data
- Template Retrieval: Searches for relevant PMFlex templates
- Methodology Context: Retrieves applicable methodology guidance
- Governance Context: Finds compliance and governance requirements
- Context Combination: Merges contexts for report generation
- Report Enhancement: Integrates context into LLM prompt
- Default:
all-MiniLM-L6-v2(384 dimensions) - Fast inference with good quality
- Suitable for German and English text
- FAISS IndexFlatIP for cosine similarity
- In-memory index with disk persistence
- Efficient for small to medium document collections
- 800 characters per chunk with 100 character overlap
- Preserves context across chunk boundaries
- Balances retrieval granularity and context
- RAG system status endpoint
- Document index validation
- Vector store statistics
- Embedding model availability
- Automatic change detection
- Incremental updates
- Document metadata tracking
- Error handling and logging
- Document processing times
- Embedding generation speed
- Search response times
- Context retrieval accuracy
-
No Documents Found
- Check document directory structure
- Verify file formats are supported
- Review file permissions
-
Embedding Model Loading
- Ensure internet connectivity for model download
- Check available disk space
- Verify sentence-transformers installation
-
Vector Store Errors
- Check FAISS installation
- Verify write permissions for vector_store directory
- Review memory availability
-
Context Quality Issues
- Adjust chunk size and overlap settings
- Review document content quality
- Tune similarity thresholds
RAG system activities are logged with appropriate levels:
- INFO: Normal operations and status
- WARNING: Non-critical issues
- ERROR: System errors requiring attention
- Multi-language support for German/English documents
- Advanced retrieval strategies (hybrid search)
- Document versioning and change tracking
- Real-time document updates
- Custom embedding fine-tuning
- Distributed vector storage
- Async document processing
- Caching layer for frequent queries
- Batch processing capabilities
PyMuPDF>=1.23.0 # PDF processing
python-docx>=0.8.11 # Word document processing
python-pptx>=0.6.21 # PowerPoint processing
faiss-cpu>=1.7.4 # Vector similarity search
sentence-transformers>=2.2.2 # Text embeddings
nltk>=3.8.1 # Text processing utilities
- Python 3.9+
- 4GB+ RAM (for embedding model)
- 1GB+ disk space (for vector store)
- Internet connectivity (initial model download)
- Document access controls
- Sensitive information handling
- API endpoint security
- Data encryption at rest
- Audit logging capabilities
When adding new documents or modifying the RAG system:
- Follow the established directory structure
- Test document processing with sample files
- Validate embedding quality and retrieval accuracy
- Update documentation and examples
- Consider performance impact of changes
For questions or support, contact the development team or refer to the main project documentation.