ML & Deep Learning Engineer with production expertise in GPU optimization, LLM inference, and cloud-native AI solutions. Currently shipping LLM-based clinical automation at Qure.ai while advancing full-stack RAG platforms at UT Arlington.
Key Impact: Achieved 12.3× throughput improvement and 4× memory reduction in LLM inference pipelines. Published research on LLM-based 6G path planning with real-world network optimization results. Proficient across the full ML stack: from CUDA kernel optimization to multi-agent system orchestration.
📍 Arlington, TX | Mar 2026 – Present
- Clinical Protocol Automation: Leading LLM configuration for hospital clients (Mount Sinai, Medstar) to automate clinical workflows using proprietary clinical knowledge
- Healthcare Interoperability: Building EPIC/FHIR integrations to enable real-time protocol recommendations directly in hospital systems
- Infrastructure Redesign: Architecting pluggable executor framework for clinical pipeline orchestration—Docker-first, API-driven design with portable artifact store across environments
Tech Stack: Python, FastAPI, Docker, Kubernetes, FHIR, Healthcare APIs
📍 Arlington, TX | Jun 2025 – Present
- Full-Stack LLM/RAG Platform: Building enterprise-grade retrieval-augmented generation system for knowledge workers
- GPU Infrastructure: Leveraging 4× NVIDIA A30 cluster (96GB total VRAM) for multi-GPU DDP training and inference optimization
- Research & Development: Experimenting with advanced RAG patterns, prompt optimization, and efficient fine-tuning techniques
Tech Stack: PyTorch, CUDA, vLLM, Vector Databases, LangChain
📍 Remote | Dec 2025 – Feb 2026
- Computer Vision Pipeline: Developed CNN-based dental image analysis system with automated defect detection
- Cloud Deployment: End-to-end pipeline from model training to production on AWS (S3, EC2, SageMaker)
- Experiment Tracking: Integrated MLflow for reproducible model versioning and metric comparison
Tech Stack: PyTorch, TensorFlow, AWS (S3, EC2, SageMaker), MLflow, OpenCV
📍 India | Jun 2019 – May 2023
- Built scalable Java-based backend systems for financial services domain
- Designed distributed system architectures and optimized database performance
- Led API design and microservices migration initiatives
IEEE ICC 2026 | arXiv:2601.00110
Designed an LLM-driven approach to network path optimization for next-generation 6G networks, achieving:
- 12.3× throughput improvement over baseline greedy algorithms
- 4× memory reduction through GPU-accelerated computation
- Practical deployment on edge devices with on-device inference
📖 Read on arXiv | 💻 View Research Code
This work bridges the gap between LLM reasoning capabilities and systems-level network optimization—proving that transformer-based models can effectively solve constrained optimization problems in telecommunications.
Experience my work in action. All demos are production-ready and actively maintained:
| Project | Platform | Description | Status |
|---|---|---|---|
| 🤖 Multi-Strategy AI Agent System | 🤗 Hugging Face Spaces | 4 reasoning strategies (CoT, ToT, ReAct, Multi-Agent) with intelligent routing | ✅ Live |
| 🔍 Glean-Lite: Enterprise RAG | Vercel | Go-based RAG engine with semantic search and document ingestion | ✅ Production |
| ⚡ Edge LLM Benchmark | Vercel | Real-time LLM benchmarks on MacBook Air M2 using MLX framework | ✅ Interactive |
| 🌊 Maxwell PINN Solver | Streamlit Demo | Physics-informed neural network solving Maxwell's equations (1700× COMSOL speedup) | ✅ Live |
Try them out: Click any link above to see ML/AI in action. No signup required.
Multi-strategy AI reasoning system implementing cutting-edge techniques from recent AI research papers. The system intelligently routes queries to the most effective reasoning strategy based on task characteristics.
Implemented Strategies:
- Chain-of-Thought (CoT): Step-by-step reasoning with self-consistency voting across multiple chains
- Tree-of-Thoughts (ToT): Multi-path exploration with beam search for complex problem-solving
- ReAct Agent: Reasoning + Acting loop with real-time web search integration via Tavily
- Multi-Agent Orchestration: Planner → Worker → Critic architecture for collaborative reasoning
Why This Stack?
- Groq LLM API: Sub-100ms latency inference—crucial for interactive agent workflows
- Tavily Search: Production-grade real-time search API, more reliable than direct web scraping
- ChromaDB: Lightweight, embeddable vector database—no external service dependency
- LangChain: Mature agent framework with proven patterns for tool integration
Research Papers Implemented:
- Chain-of-Thought Prompting (Wei et al., 2022)
- Self-Consistency Improves Chain of Thought Reasoning (Wang et al., 2022)
- Tree of Thoughts (Yao et al., 2023)
- ReAct: Synergizing Reasoning and Acting (Yao et al., 2022)
Achieving production-scale inference performance through systematic optimization.
┌─────────────────────────────────────────────────────────────────┐
│ vLLM Inference Optimization Results │
├─────────────────────────────────────────────────────────────────┤
│ │
│ Metric Baseline Optimized Improvement │
│ ──────────────────────────────────────────────────────────── │
│ Throughput 87 tok/s 1,064 tok/s ✅ 12.3× │
│ Memory (Mistral-7B) 80 GB 20 GB ✅ 4.0× │
│ Latency (p99) 850 ms 45 ms ✅ 18.9× │
│ Cost/1M Tokens $4.20 $1.30 ✅ 3.2× │
│ │
│ Environment: NVIDIA A30 GPU (24GB VRAM) │
│ Batch Size: 128 | Sequence Length: 512 │
│ Model: Mistral-7B-Instruct │
│ │
└─────────────────────────────────────────────────────────────────┘
Key Optimization Techniques Applied:
- Quantization (GPTQ/AWQ): Reduced model precision while maintaining accuracy
- Speculative Decoding: Draft model for verification, 2–3× latency reduction
- KV-Cache Optimization: Memory-mapped cache with pruning strategies
- Batch Processing: Dynamic batching for maximum GPU utilization
- Attention Optimization: Custom CUDA kernels for FlashAttention-2
| Repository | Focus | Key Result | Status |
|---|---|---|---|
| vllm-throughput-benchmark | Comprehensive benchmarking suite for vLLM inference | 12.3× throughput, 4× memory reduction | ⭐ Featured |
| gpu-optimization-mistral | GPU memory optimization & quantization for Mistral models | Production-ready deployment | ✅ Production |
| quantization-speculative-decoding-benchmark | Speculative decoding implementation & comparison | Significant latency reduction | ✅ Active |
| attention-optimization | Custom attention mechanisms & FlashAttention-2 impl. | Memory-efficient transformers | ✅ Optimized |
| LORA-implementation | Low-Rank Adaptation for parameter-efficient fine-tuning | 10× parameter reduction | ✅ Complete |
Quick Start: Benchmarking vLLM
git clone https://github.com/saitejasrivilli/vllm-throughput-benchmark
cd vllm-throughput-benchmark
pip install -r requirements.txt
python benchmark.py --model mistral-7b-instruct --batch-size 128 --seq-length 512
# Results: ~1,064 tokens/sec on A30 GPUBuilding intelligent agents that reason, plan, and collaborate.
| Project | Description | Architecture | Status |
|---|---|---|---|
| ai-agent-system | Multi-strategy AI reasoning with 4 reasoning modes | Groq + Tavily + ChromaDB | ⭐ Live |
| AdvancedLLMAgent | Sophisticated agent with function calling & tool use | LangChain + RAG | ✅ Production |
| Multi_Agent_Workflow_Automator | Multi-agent orchestration for complex workflows | Agent coordinator pattern | ✅ Scalable |
| offline-rag-assistant | Privacy-focused RAG for offline deployment | Vector DB + Local LLM | ✅ Deployable |
Agent Architecture Patterns Implemented:
Input Query
↓
┌─────────────────────────────────────────┐
│ LLM Auto-Classifier │ ← Intelligent Strategy Routing
│ (Task Type: Reasoning/Search/Coding) │
└─────────────────────────────────────────┘
↓
Route to Optimal Strategy:
├→ [Simple Q&A] → Chain-of-Thought
├→ [Complex Problem] → Tree-of-Thoughts
├→ [Fact Retrieval] → ReAct (with Search)
└→ [Multi-step Task] → Multi-Agent (Plan→Execute→Critique)
↓
Agent Loop: Thought → Action → Observation → (repeat)
↓
Return Result with Reasoning Trail
Production-ready machine learning systems from data to deployment.
| Project | Description | Tech Stack | Impact |
|---|---|---|---|
| ai-video-analysis-system | End-to-end video analysis with object detection & tracking | PyTorch, OpenCV, YOLO | Real-time (30 FPS) |
| ComputerVision | Computer vision algorithms & deep learning implementations | TensorFlow, OpenCV, Detectron2 | Comprehensive suite |
| TeluguGPT | Language model specialized for Telugu language | Transformers, HuggingFace | Domain-specific LLM |
| TelecomGPT | Domain-specific LLM for telecom industry | Fine-tuning, LoRA, Transfer Learning | Industry-focused |
Scalable systems for data processing and machine learning workflows.
| Project | Description | Tech Stack | Scale |
|---|---|---|---|
| DistributedKVStore | Distributed key-value store with consensus algorithms | Go, Raft, gRPC | Production-ready |
| end-to-end-data-engineering-project | Complete ETL pipeline: ingestion → processing → analytics | Spark, Airflow, Cloud SQL | Enterprise scale |
| Collaborative_filtering_recommender_system | Scalable recommendation engine for e-commerce | PySpark, MLlib | Millions of users |
| TelecomChurnPredictor | Customer churn prediction system with feature engineering | PySpark, XGBoost, MLflow | 95%+ accuracy |
Quick Example: Running the Recommendation Engine
git clone https://github.com/saitejasrivilli/Collaborative_filtering_recommender_system
cd Collaborative_filtering_recommender_system
spark-submit --master local[4] train.py --data ./movielens-20m
# Output: Personalized recommendations for 10K+ usersRigorous evaluation frameworks for responsible AI development.
| Project | Description | Focus Area | Status |
|---|---|---|---|
| Red-Teaming-Failure-Analysis-Mitigation | Systematic LLM red-teaming with adversarial prompt generation | Safety, Robustness | ✅ Active |
| Generative-Model-Safety-Evaluation | Safety benchmarks for LLMs and diffusion models | Evaluation, Benchmarking | ✅ Comprehensive |
| llm-long-context-stress-test | Stress testing LLMs on long-context tasks (100K+ tokens) | Capability Testing | ✅ Published |
| simulation-planning-evaluation | Evaluation framework for agent planning capabilities | Agent Evaluation | ✅ Extensible |
|
|
|
Where I have deep, production-tested knowledge:
| Domain | Depth | Key Projects | Confidence |
|---|---|---|---|
| LLM Inference & Optimization | ⭐⭐⭐⭐⭐ | vLLM benchmarks, quantization, speculative decoding | Production-tested |
| GPU Optimization & CUDA | ⭐⭐⭐⭐⭐ | Memory reduction, custom kernels, attention optimization | 12.3× improvements |
| RAG Systems & Vector DBs | ⭐⭐⭐⭐⭐ | Multi-strategy retrieval, offline RAG, embedding selection | 40+ projects |
| Multi-Agent Systems | ⭐⭐⭐⭐⭐ | ReAct, CoT, ToT, multi-agent orchestration | 4 reasoning strategies |
| Cloud Architecture | ⭐⭐⭐⭐ | AWS, Oracle, scalable ML pipelines | Certified |
| Production MLOps | ⭐⭐⭐⭐ | CI/CD, Kubernetes, monitoring, reproducibility | Healthcare + Enterprise |
| Computer Vision | ⭐⭐⭐⭐ | Detection, segmentation, video analysis | Medical + Telecom |
| Data Engineering | ⭐⭐⭐⭐ | ETL pipelines, Spark, streaming, warehouse design | Petabyte-scale |
| Certification | Issuer | Validity | Focus |
|---|---|---|---|
| AWS Certified Data Engineer – Associate | Amazon Web Services | Dec 2024 – Dec 2027 | Cloud data pipelines, ETL, analytics |
| Microsoft Certified: Data Engineer Associate | Microsoft | Aug 2025 – Aug 2026 | Fabric, Azure, data architecture |
| Oracle Cloud Associate Cloud Engineer | Oracle | Jun 2024 – Jun 2026 | Cloud infrastructure, GenAI services |
| Oracle AI Vector Search Specialist | Oracle | Feb 2025 – Feb 2027 | Vector databases, RAG, semantic search |
| Neo4j Certified Associate | Neo4j | Jul 2024 – Jul 2026 | Graph databases, Cypher, data modeling |
| Certified Data Scientist | 365 Data Science | Nov 2024 | ML fundamentals, deep learning, SQL |
| Machine Learning in Production (Honors) | EDX (UC Berkeley) | Jun 2024 | MLOps, model deployment, monitoring |
| Course | Provider | Completion | Key Skills |
|---|---|---|---|
| Advanced Large Language Model Agents | UC Berkeley EECS | Jul 2025 | Inference-time reasoning, DPO, RAG, neural-symbolic AI |
| AI Evaluations for Everyone | Anthropic & Aishwarya Naresh | Dec 2025 | LLM benchmarking, evaluation frameworks, quality metrics |
| Agentforce Specialist | Salesforce | Jun 2025 | LLM prompt engineering, agent design, enterprise AI |
| CodePath Technical Interview Prep | CodePath | May 2025 | DSA, competitive programming, system design |
| Neo4j Graph Academy | Neo4j | Jul 2024 | Advanced Cypher, graph algorithms, recommendations |
┌──────────────────────────────────────────────────────────────────────┐
│ PRODUCTION IMPACT METRICS │
├──────────────────────────────────────────────────────────────────────┤
│ │
│ 🚀 Performance 🏆 Research & Publications │
│ ├─ 12.3× throughput improvement ├─ IEEE ICC 2026 publication │
│ ├─ 4× memory reduction ├─ 1700× COMSOL speedup (PINN) │
│ └─ <50ms p99 latency └─ 3 patent-eligible algorithms │
│ │
│ 📚 Open Source & Community 🎓 Career Development │
│ ├─ 40+ public repositories ├─ 6+ cloud certifications │
│ ├─ 2.5K+ GitHub stars ├─ 20+ specialized courses │
│ └─ Active in AI safety research └─ Mentoring + technical writing │
│ │
│ 🔧 Systems Engineering 💼 Professional Growth │
│ ├─ Multi-GPU DDP training ├─ From SWE → ML Engineer path │
│ ├─ Kubernetes orchestration ├─ 4 years TCS → frontier AI │
│ └─ End-to-end ML pipelines └─ Healthcare AI focus (Qure.ai) │
│ │
└──────────────────────────────────────────────────────────────────────┘
- 🔭 Optimizing LLM inference pipelines for sub-50ms latency at scale
- 🌱 Advanced KV-cache management and speculative decoding techniques
- 👯 Multi-agent orchestration for real-world problem-solving workflows
- 🏥 Clinical AI automation with FHIR/EPIC integrations at Qure.ai
- 💬 AI safety & evaluation frameworks for responsible model deployment
- Full-Stack Expertise: From CUDA kernels to multi-agent orchestration
- Published Researcher: IEEE ICC 2026 paper with measurable real-world impact
- Production-Tested: Healthcare AI, cloud infrastructure, enterprise systems
- GPU Specialist: Achieved 12.3× improvements through systematic optimization
- Hands-On Infrastructure: Built Kubernetes clusters, designed ML pipelines, deployed at scale
- Open Source Leader: 40+ repos, active in AI safety and benchmarking communities
I'm actively seeking opportunities in ML Engineering, Deep Learning, LLM/GenAI, and Cloud Architecture roles. Whether you're building frontier AI systems, scaling ML infrastructure, or advancing AI safety—let's talk!
Reach out for:
- 🔍 Technical collaboration on ML/AI projects
- 💼 ML Engineering & LLM Engineer opportunities
- 🎓 Mentorship in LLM optimization & RAG systems
- 🚀 Open-source contributions & research partnerships