Nyay Mitra is a full-stack, production-grade legal AI platform built for the Indian legal system. It uses a LangGraph-based agentic execution pipeline that replaces traditional single-LLM chatbot patterns with a controlled, parallel, role-aware multi-agent reasoning system.
Unlike conventional chatbots, Nyay Mitra:
- Routes queries through a deterministic + semantic intent classification system
- Executes multiple agents in parallel (RAG, Chat, Document Analysis)
- Aggregates and synthesizes results with context-aware memory
- Supports lawyer workspace mode with persistent matter context
- Handles multilingual Indian legal queries with translation pipelines
- Normal Mode — Direct lightweight chat for quick legal queries
- Agent Mode — Full LangGraph execution with RAG, Document Analysis, and Chat
- Document Upload & Chat — Upload PDFs/text files and ask questions
- Multilingual Support — Query in Hindi, Bengali, Urdu, Tamil, Telugu, and more
- Contextual Memory — Rolling conversation summary for follow-on queries
- Matter-Based Workspace — Persistent legal memory per matter (case)
- Workspace Memory Injection — Party summaries, legal issues, chronology injected into every AI response
- Multi-Document RAG — Vector-indexed matter documents with semantic retrieval
- Document Analysis — Structured legal analysis with clause-by-clause breakdown
- Contract Review — Professional (lawyer) and client-friendly review modes
- Client Communications — Auto-generate WhatsApp messages, emails, and voice note scripts
- Team Collaboration — Multiple lawyers on a matter with role-based access
- Adaptive RAG — Direct, map-reduce, and section-based retrieval strategies based on document size
- Parallel Execution — LangGraph fan-out with asyncio.gather for concurrent agent execution
- Intelligent Synthesis — Single LLM call produces both final response and updated memory summary
- Document Deduplication — SHA-256 content hashing prevents duplicate ingestion
- HuggingFace Hub Storage — Documents stored on HF Hub with background reindexing on restart
- Legal Document Generation — Jinja2 template-based legal draft generation
┌─────────────────────────────────────────────────────────────────┐
│ Frontend (React) │
└─────────────────────────┬───────────────────────────────────────┘
│
┌─────────────────────────▼───────────────────────────────────────┐
│ Node.js BFF (Express + Prisma) │
│ │
│ ┌─────────────┐ ┌──────────────┐ ┌────────────────────────┐ │
│ │ Citizen API │ │ Lawyer API │ │ Workspace Memory API │ │
│ │ /citizen │ │ /lawyer │ │ Matter · Documents │ │
│ └─────────────┘ └──────────────┘ └────────────────────────┘ │
│ │
│ PostgreSQL (via Prisma) — Conversations, Matters, Memory │
└─────────────────────────┬───────────────────────────────────────┘
│ HTTP
┌─────────────────────────▼───────────────────────────────────────┐
│ Python FastAPI Backend (v3) │
│ │
│ ┌──────────────────────────────────────────────────────────┐ │
│ │ LangGraph Orchestrator │ │
│ │ │ │
│ │workspace_context → language_detection → translation_query│ │
│ │ ↓ ↓ │ │
│ │ intent_router <- document_ingestion │ │
│ │ ↓ │ │
│ │ ┌──────────────┬──────────┐ │ │
│ │ ▼ ▼ ▼ │ │
│ │ chat_node rag_node doc_analysis │ │
│ │ └──────────────┴──────────┘ │ │
│ │ ↓ │ │
│ │ aggregator │ │
│ │ ↓ │ │
│ │ synthesizer │ │
│ │ ↓ │ │
│ │ translation_response │ │
│ └──────────────────────────────────────────────────────────┘ │
│ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────────────┐ │
│ │ RAG Service │ │ Chat Service │ │ Translation Service │ │
│ │ ChromaDB │ │ Gemini LLM │ │ Gemini (Argos │ │
│ │ Gemini Emb │ │ Rolling Mem │ │ fallback planned) │ │
│ └──────────────┘ └──────────────┘ └──────────────────────┘ │
└─────────────────────────────────────────────────────────────────┘
│
┌───────────▼────────────┐
│ HuggingFace Hub │
│ Document Storage │
│ (PDF/TXT uploads) │
└────────────────────────┘
[START]
│
▼
workspace_context_node ← Injects matter/workspace context for lawyer mode
│
▼
language_detection_node ← Detects language, sets translation flags
│ Early exit if input_language = "en"
├──[needs translation]──► translation_query_node
│
▼
document_ingestion_node ← SHA-256 dedup, chunking, Gemini embeddings → ChromaDB
│ Skips if no storage_url
▼
intent_router_node ← 3-layer routing:
│ L1: Deterministic (has_document, is_summary, is_web)
│ L2: LLM semantic classifier (7 intent classes)
│ L3: Safety net (chat fallback)
│
├──[use_chat]────────────► chat_node
├──[use_rag]─────────────► rag_node ← Parallel fan-out
├──[use_document_analysis]► doc_analysis_node ← via asyncio.gather
└──[use_web]─────────────► web_node (planned)
│
▼
aggregator_node ← Collects outputs from all parallel nodes
│
▼
synthesizer_node ← Context-aware LLM synthesis
│ previous_summary + all outputs → final_response + updated_summary
│ Single LLM call for both outputs
├──[chat only]──► passthrough (no LLM call)
│
▼
translation_response_node ← Translates response if output_language ≠ en
│
▼
[END] → final_response + updated_summary → BFF
| Layer | Trigger | Method | Latency |
|---|---|---|---|
| L1 Deterministic | has_document, is_summary, is_web |
State-based rules | ~0ms |
| L2 LLM Semantic | General queries, no document | Gemini classifier (7 intents) | ~1-2s |
| L3 Safety Net | LLM timeout/failure | Fallback to chat |
~0ms |
7 Intent Classes:
legal_explanation → chat only
document_question → rag only
document_summary → document_analysis only
comparison → rag + chat
legal_research → web + chat
greeting → chat only
followup → chat only
Document chunks:
≤ 6 chunks → Direct analysis (single LLM call)
≤ 20 chunks → Summarization analysis (parallel batch LLM calls)
> 20 chunks → Section-based analysis (concurrent section LLM calls)
Query retrieval:
≤ 4 chunks → Direct RAG (single LLM call)
> 4 chunks → Map-Reduce RAG (parallel batches → reduce)
Query → Semantic Embedding → ChromaDB Vector Search
→ Top-K Retrieval (semantic / full mode)
→ Context Construction (token budget: 8000)
→ Direct / Map-Reduce LLM Execution
→ Answer + Sources
| Scenario | Before Optimization | After Optimization |
|---|---|---|
| Document Analysis (13 chunks) | ~46s sequential | ~5-8s parallel |
| RAG Query (large doc) | ~36s sequential | ~6-8s map-reduce |
| English query routing | ~5s (LLM detection) | ~0ms (early exit) |
| First doc upload | — | ~5-10s |
| Follow-on doc query | — | ~3-5s |
- Wraps
chat_service.get_chatbot_response() - Memory:
previous_summary(rolling, 10-sentence LLM summary) +history[-6:](exact window) - Two concurrent LLM calls: chat response + summary update (
asyncio.gather) - Workspace context injection for lawyer mode
- Selects retrieval strategy based on document size and query type
- Meta-query detection (avoids vector search for "what document is this?" type queries)
- Two prompt variants: semantic precision and comprehensive retrieval
- Three analysis tiers (direct, summarization, section-based)
- Analysis cache: keyed by
document_id:mode:top_k, max 100 entries FIFO - All batch/section LLM calls parallelized
- Wraps
translate_service.translate_text() - Gemini-based translation with legal terminology preservation
- Prefix stripping for clean output
langdetectlibrary with confidence scoring- User-provided language early exit (skips agent entirely)
- Falls back to English for unsupported languages
previous_summary (rolling, 10 sentences, LLM-generated)
+ history[-6:] (exact last 6 messages, 800 char/msg cap)
→ chat_service.main_chat_prompt
→ response + updated_summary (parallel)
previous_summary (conversation-level rolling memory)
+ workspace_context (matter-level persistent memory)
- partySummary, legalIssues, factChronology
- lawyerNotes, aiSummary, keyDates, documentIndex
+ history[-10:] (larger window for lawyer workflows)
→ injected into chat prompt + synthesizer prompt
For all non-chat paths (RAG, Analysis, multi-source):
previous_summary + query + all node outputs
→ Single LLM call → final_response + updated_summary
No extra LLM call — both outputs from same inference
python_backend/
├── main.py # FastAPI app, lifespan, routers
├── orchestrator/
│ ├── graph.py # LangGraph StateGraph definition
│ ├── state.py # GraphState with reducers
│ └── service_registry.py # Agent singletons
├── nodes/
│ ├── workspace_context_node.py # Lawyer workspace injection
│ ├── language_detection_node.py # Language detection + early exit
│ ├── translation_query_node.py # Query translation
│ ├── document_ingestion_node.py # Upload, chunk, embed, store
│ ├── intent_router_node.py # 3-layer intent classification
│ ├── chat_node.py # General chat execution
│ ├── rag_node.py # RAG execution
│ ├── document_analysis_node.py # Document analysis execution
│ ├── aggregator_node.py # Parallel output collection
│ ├── synthesizer_node.py # Context-aware synthesis
│ └── translation_response_node.py # Response translation
├── agents/
│ ├── general_chat_agent.py # Chat agent wrapper
│ ├── rag_agent.py # RAG agent (no inner LangGraph)
│ ├── document_analysis_agent.py # Analysis agent (no inner LangGraph)
│ ├── translation_agent.py # Translation agent
│ └── language_detection_agent.py # Language detection agent
├── services/
│ ├── rag_service.py # ChromaDB, embeddings, retrieval, analysis
│ ├── chat_service.py # Chat LLM, rolling memory, summary
│ ├── docgen_service.py # Document generation service
│ ├── translate_service.py # Gemini translation
│ └── language_detection_service.py # langdetect wrapper
├── routers/
│ ├── agent_router.py # /agent/chat, /agent/upload-and-chat
│ ├── chatbot_router.py # Legacy direct chat
│ ├── translate_router.py # Translation API
│ ├── language_detection_router.py # Language detection API
│ └── docgen_router.py # Document generation API
└── utils/
├── llm_service.py # llm_generate with fence stripping + JSON extraction
├── graph_utils.py # is_executed() guard
├── logger.py # Structured logging
├── config.py # Settings from env
└── constants.py # Node name constants
node_backend/
├── src/
│ ├── module/
│ │ ├── lawyer/
│ │ │ ├── chat/ # Lawyer chat service + controller
│ │ │ ├── comms/ # Client communication generation
│ │ │ ├── contract/ # Contract review
│ │ │ ├── document/ # Matter document management
│ │ │ ├── matter/ # Matter CRUD
│ │ │ └── workspace/ # WorkspaceMemory service
│ │ ├── citizen/ # Citizen chat + agent mode
│ │ └── auth/ # JWT auth
│ ├── services/
│ │ └── python-backend.service.ts # Python API client
│ └── config/
│ └── database.ts # Prisma client
└── prisma/
└── schema.prisma # Matter, WorkspaceMemory, LawyerConversation...
Full LangGraph agent chat for text queries.
{
"query": "what is anticipatory bail under CrPC",
"session_id": "uuid",
"document_id": "chromadb-uuid",
"history": [{"role": "user", "content": "..."}, {"role": "assistant", "content": "..."}],
"previous_summary": "Prior conversation summary...",
"conversation_id": "conv-uuid",
"conversation_type": "matter_workspace | standalone | client_comms_generation",
"user_role": "LAWYER | CITIZEN",
"matter_id": "matter-uuid",
"workspace_memory": {
"partySummary": "...",
"factChronology": "...",
"legalIssues": "...",
"lawyerNotes": "...",
"aiSummary": "...",
"keyDates": [{"label": "Next hearing", "date": "2024-08-01"}],
"documentIndex": [{"id": "...", "title": "...", "vectorDocId": "..."}]
},
"input_language": "en",
"output_language": "hi"
}Response:
{
"response": "...",
"session_id": "uuid",
"updated_summary": "...",
"agents_used": ["intent_router", "rag_agent", "synthesizer"],
"execution_trace": [...],
"agent_outputs": {...},
"language_info": {"detected_language": "en", "translated_query": "..."}
}Upload a document and ask a question in one request (multipart/form-data).
| Field | Type | Required |
|---|---|---|
file |
binary | ✅ |
query |
string | ✅ |
session_id |
string | — |
previous_summary |
string | — |
input_language |
string | — |
output_language |
string | — |
user_role |
string | — |
Response adds document_id and storage_url to the standard agent response.
Translate text using Gemini with legal terminology preservation.
{
"text": "...",
"source_lang": "en",
"target_lang": "hi"
}System health check — RAG service, translation service, language detection service.
The Python backend is designed for HuggingFace Spaces deployment:
# In your HF Space app.py / main.py
# ChromaDB uses /tmp/chroma_db (ephemeral but restored on restart via _reindex_from_hub)
# Background reindexing runs on startup — server is immediately available
On startup the system:
- Initializes all services (translation, language detection, RAG)
- Starts background re-indexing from HF Hub (non-blocking)
- Restores all documents from HF Hub storage within 30-60s
/tmp/chroma_db— ephemeral, restored on restart via_reindex_from_hub()- Documents uploaded to HF Hub dataset survive restarts
- Translation service is Gemini-based — no package downloads needed
- Language detection uses
langdetect(lightweight, no model downloads)
- JWT authentication for all lawyer endpoints
authenticateLawyermiddleware on all lawyer routes- Matter ownership verified on every request
- Document access scoped to matter membership
- CORS configured (restrict
allow_originsin production) - API keys via environment variables only
| Decision | Rationale |
|---|---|
| LangGraph over custom orchestrator | Reliable parallel execution, built-in state management, conditional routing |
| Gemini 2.5 Flash over GPT-4 | Cost efficiency, strong multilingual Indian language support |
| ChromaDB over Pinecone | Self-hosted, no per-query cost, sufficient for current scale |
| Rolling summary over full history | Bounded token cost regardless of conversation length |
Synthesizer owns updated_summary |
Single source of truth — no inconsistency across agents |
| SHA-256 deduplication | Prevents re-ingestion of same document across uploads/restarts |
| Background reindexing | Server available immediately — documents restored async |
| Gemini translation over Argos | Quality for Indian legal content — no model downloads, correct terminology |
CHUNK_SIZE = 800
CHUNK_OVERLAP = 150
MAX_CHUNKS_PER_QUERY = 16
MAX_CONTEXT_TOKENS = 8000
MAX_ANALYSIS_TOKENS = 20000
RAG_DIRECT_THRESHOLD = 4 # chunks — below this: single LLM call
RAG_BATCH_SIZE = 4 # chunks per map-reduce batch
_llm_semaphore = Semaphore(5) # max concurrent Gemini calls
_analysis_cache_max = 100 # FIFO evictionHISTORY_WINDOW = 6 # recent messages (exact)
SUMMARY_MAX_SENTENCES = 10 # rolling summary cap
MESSAGE_CHAR_CAP = 800 # per message in recent_block
TEMPERATURE_CHAT = 0.5 # balanced for legal accuracy
TEMPERATURE_SUMMARY = 0.0 # deterministic summariesLLM_CLASSIFY_TIMEOUT = 10.0 # seconds before fallbackGET /docs → Swagger UI
POST /api/v3/agent/chat
curl http://localhost:8080/healthcurl -X POST http://localhost:8080/api/v3/translate \
-H "Content-Type: application/json" \
-d '{"text": "What is bail in Indian law", "source_lang": "en", "target_lang": "hi"}'- Web search agent integration (Serper/Tavily)
- Workspace context injection for lawyer mode (in progress)
- Hinglish (Roman script Hindi) detection and handling
- Multi-document RAG (query across multiple documents simultaneously)
- Persistent ChromaDB storage (upgrade to paid HF Spaces)
- Query rewriting for ambiguous pronoun resolution
- Streaming responses
- Rate limiting and request queuing
- Gemini paid tier upgrade (removes 20 req/day limit)
Nyay Mitra is an AI-powered tool for legal information purposes only. It does not constitute legal advice. Always consult a qualified legal professional for advice specific to your situation.
Tejasvi Aryan — AI/ML Architecture, LangGraph Pipeline, RAG System, Python Backend
Anubrata Guin — Node.js BFF, Prisma Schema, Lawyer Dashboard, Workspace System
⭐ Star this repository if you find it useful
Made with ❤️ for Indian legal accessibility