Sai Teja Srivilliibhutturu saitejasrivilli

👋 Hi, I'm Sai Teja Srivillibhutturu

ML & Deep Learning Engineer | LLM Specialist | Cloud Architect

🎯 Professional Summary

ML & Deep Learning Engineer with production expertise in GPU optimization, LLM inference, and cloud-native AI solutions. Currently shipping LLM-based clinical automation at Qure.ai while advancing full-stack RAG platforms at UT Arlington.

Key Impact: Achieved 12.3× throughput improvement and 4× memory reduction in LLM inference pipelines. Published research on LLM-based 6G path planning with real-world network optimization results. Proficient across the full ML stack: from CUDA kernel optimization to multi-agent system orchestration.

💼 Professional Experience

AI Solutions Engineer Intern @ Qure.ai

📍 Arlington, TX | Mar 2026 – Present

Clinical Protocol Automation: Leading LLM configuration for hospital clients (Mount Sinai, Medstar) to automate clinical workflows using proprietary clinical knowledge
Healthcare Interoperability: Building EPIC/FHIR integrations to enable real-time protocol recommendations directly in hospital systems
Infrastructure Redesign: Architecting pluggable executor framework for clinical pipeline orchestration—Docker-first, API-driven design with portable artifact store across environments

Tech Stack: Python, FastAPI, Docker, Kubernetes, FHIR, Healthcare APIs

Graduate Research Assistant – TopGPT Project @ UT Arlington

📍 Arlington, TX | Jun 2025 – Present

Full-Stack LLM/RAG Platform: Building enterprise-grade retrieval-augmented generation system for knowledge workers
GPU Infrastructure: Leveraging 4× NVIDIA A30 cluster (96GB total VRAM) for multi-GPU DDP training and inference optimization
Research & Development: Experimenting with advanced RAG patterns, prompt optimization, and efficient fine-tuning techniques

Tech Stack: PyTorch, CUDA, vLLM, Vector Databases, LangChain

ML Engineer (Contract) @ DentalScan / ReplyQuickAI

📍 Remote | Dec 2025 – Feb 2026

Computer Vision Pipeline: Developed CNN-based dental image analysis system with automated defect detection
Cloud Deployment: End-to-end pipeline from model training to production on AWS (S3, EC2, SageMaker)
Experiment Tracking: Integrated MLflow for reproducible model versioning and metric comparison

Tech Stack: PyTorch, TensorFlow, AWS (S3, EC2, SageMaker), MLflow, OpenCV

Software Engineer (4 yrs) @ Tata Consultancy Services

📍 India | Jun 2019 – May 2023

Built scalable Java-based backend systems for financial services domain
Designed distributed system architectures and optimized database performance
Led API design and microservices migration initiatives

📚 Publications & Research

CTMap: LLM-based 6G Path Planning

IEEE ICC 2026 | arXiv:2601.00110

Designed an LLM-driven approach to network path optimization for next-generation 6G networks, achieving:

12.3× throughput improvement over baseline greedy algorithms
4× memory reduction through GPU-accelerated computation
Practical deployment on edge devices with on-device inference

📖 Read on arXiv | 💻 View Research Code

This work bridges the gap between LLM reasoning capabilities and systems-level network optimization—proving that transformer-based models can effectively solve constrained optimization problems in telecommunications.

🌐 Live Deployments & Interactive Demos

Experience my work in action. All demos are production-ready and actively maintained:

Project	Platform	Description	Status
🤖 Multi-Strategy AI Agent System	🤗 Hugging Face Spaces	4 reasoning strategies (CoT, ToT, ReAct, Multi-Agent) with intelligent routing	✅ Live
🔍 Glean-Lite: Enterprise RAG	Vercel	Go-based RAG engine with semantic search and document ingestion	✅ Production
⚡ Edge LLM Benchmark	Vercel	Real-time LLM benchmarks on MacBook Air M2 using MLX framework	✅ Interactive
🌊 Maxwell PINN Solver	Streamlit Demo	Physics-informed neural network solving Maxwell's equations (1700× COMSOL speedup)	✅ Live

Try them out: Click any link above to see ML/AI in action. No signup required.

🚀 Featured Projects

⭐ Advanced AI Agent System — Multi-Strategy Reasoning

Multi-strategy AI reasoning system implementing cutting-edge techniques from recent AI research papers. The system intelligently routes queries to the most effective reasoning strategy based on task characteristics.

Implemented Strategies:

Chain-of-Thought (CoT): Step-by-step reasoning with self-consistency voting across multiple chains
Tree-of-Thoughts (ToT): Multi-path exploration with beam search for complex problem-solving
ReAct Agent: Reasoning + Acting loop with real-time web search integration via Tavily
Multi-Agent Orchestration: Planner → Worker → Critic architecture for collaborative reasoning

Why This Stack?

Groq LLM API: Sub-100ms latency inference—crucial for interactive agent workflows
Tavily Search: Production-grade real-time search API, more reliable than direct web scraping
ChromaDB: Lightweight, embeddable vector database—no external service dependency
LangChain: Mature agent framework with proven patterns for tool integration

Research Papers Implemented:

🔗 Live Demo | 📖 GitHub

🔥 LLM & GPU Optimization

Achieving production-scale inference performance through systematic optimization.

Performance Benchmarks

┌─────────────────────────────────────────────────────────────────┐
│              vLLM Inference Optimization Results                 │
├─────────────────────────────────────────────────────────────────┤
│                                                                   │
│  Metric              Baseline    Optimized    Improvement        │
│  ────────────────────────────────────────────────────────────   │
│  Throughput          87 tok/s    1,064 tok/s  ✅ 12.3×          │
│  Memory (Mistral-7B) 80 GB       20 GB        ✅ 4.0×           │
│  Latency (p99)       850 ms      45 ms        ✅ 18.9×          │
│  Cost/1M Tokens      $4.20       $1.30        ✅ 3.2×           │
│                                                                   │
│  Environment: NVIDIA A30 GPU (24GB VRAM)                         │
│  Batch Size: 128  | Sequence Length: 512                         │
│  Model: Mistral-7B-Instruct                                      │
│                                                                   │
└─────────────────────────────────────────────────────────────────┘

Key Optimization Techniques Applied:

Quantization (GPTQ/AWQ): Reduced model precision while maintaining accuracy
Speculative Decoding: Draft model for verification, 2–3× latency reduction
KV-Cache Optimization: Memory-mapped cache with pruning strategies
Batch Processing: Dynamic batching for maximum GPU utilization
Attention Optimization: Custom CUDA kernels for FlashAttention-2

Featured Optimization Projects

Repository	Focus	Key Result	Status
vllm-throughput-benchmark	Comprehensive benchmarking suite for vLLM inference	12.3× throughput, 4× memory reduction	⭐ Featured
gpu-optimization-mistral	GPU memory optimization & quantization for Mistral models	Production-ready deployment	✅ Production
quantization-speculative-decoding-benchmark	Speculative decoding implementation & comparison	Significant latency reduction	✅ Active
attention-optimization	Custom attention mechanisms & FlashAttention-2 impl.	Memory-efficient transformers	✅ Optimized
LORA-implementation	Low-Rank Adaptation for parameter-efficient fine-tuning	10× parameter reduction	✅ Complete

Quick Start: Benchmarking vLLM

git clone https://github.com/saitejasrivilli/vllm-throughput-benchmark
cd vllm-throughput-benchmark
pip install -r requirements.txt
python benchmark.py --model mistral-7b-instruct --batch-size 128 --seq-length 512
# Results: ~1,064 tokens/sec on A30 GPU

🤖 AI Agents & Multi-Agent Systems

Building intelligent agents that reason, plan, and collaborate.

Project	Description	Architecture	Status
ai-agent-system	Multi-strategy AI reasoning with 4 reasoning modes	Groq + Tavily + ChromaDB	⭐ Live
AdvancedLLMAgent	Sophisticated agent with function calling & tool use	LangChain + RAG	✅ Production
Multi_Agent_Workflow_Automator	Multi-agent orchestration for complex workflows	Agent coordinator pattern	✅ Scalable
offline-rag-assistant	Privacy-focused RAG for offline deployment	Vector DB + Local LLM	✅ Deployable

Agent Architecture Patterns Implemented:

Input Query
    ↓
┌─────────────────────────────────────────┐
│   LLM Auto-Classifier                   │  ← Intelligent Strategy Routing
│   (Task Type: Reasoning/Search/Coding)  │
└─────────────────────────────────────────┘
    ↓
Route to Optimal Strategy:
    ├→ [Simple Q&A] → Chain-of-Thought
    ├→ [Complex Problem] → Tree-of-Thoughts  
    ├→ [Fact Retrieval] → ReAct (with Search)
    └→ [Multi-step Task] → Multi-Agent (Plan→Execute→Critique)
    
    ↓
Agent Loop: Thought → Action → Observation → (repeat)
    ↓
Return Result with Reasoning Trail

🔬 ML Systems & Computer Vision

Production-ready machine learning systems from data to deployment.

Project	Description	Tech Stack	Impact
ai-video-analysis-system	End-to-end video analysis with object detection & tracking	PyTorch, OpenCV, YOLO	Real-time (30 FPS)
ComputerVision	Computer vision algorithms & deep learning implementations	TensorFlow, OpenCV, Detectron2	Comprehensive suite
TeluguGPT	Language model specialized for Telugu language	Transformers, HuggingFace	Domain-specific LLM
TelecomGPT	Domain-specific LLM for telecom industry	Fine-tuning, LoRA, Transfer Learning	Industry-focused

📊 Data Engineering & ML Pipelines

Scalable systems for data processing and machine learning workflows.

Project	Description	Tech Stack	Scale
DistributedKVStore	Distributed key-value store with consensus algorithms	Go, Raft, gRPC	Production-ready
end-to-end-data-engineering-project	Complete ETL pipeline: ingestion → processing → analytics	Spark, Airflow, Cloud SQL	Enterprise scale
Collaborative_filtering_recommender_system	Scalable recommendation engine for e-commerce	PySpark, MLlib	Millions of users
TelecomChurnPredictor	Customer churn prediction system with feature engineering	PySpark, XGBoost, MLflow	95%+ accuracy

Quick Example: Running the Recommendation Engine

git clone https://github.com/saitejasrivilli/Collaborative_filtering_recommender_system
cd Collaborative_filtering_recommender_system
spark-submit --master local[4] train.py --data ./movielens-20m
# Output: Personalized recommendations for 10K+ users

🛡️ AI Safety & Evaluation

Rigorous evaluation frameworks for responsible AI development.

Project	Description	Focus Area	Status
Red-Teaming-Failure-Analysis-Mitigation	Systematic LLM red-teaming with adversarial prompt generation	Safety, Robustness	✅ Active
Generative-Model-Safety-Evaluation	Safety benchmarks for LLMs and diffusion models	Evaluation, Benchmarking	✅ Comprehensive
llm-long-context-stress-test	Stress testing LLMs on long-context tasks (100K+ tokens)	Capability Testing	✅ Published
simulation-planning-evaluation	Evaluation framework for agent planning capabilities	Agent Evaluation	✅ Extensible

🛠️ Technical Skills & Expertise

🤖 ML/DL & LLMs

Core: PyTorch, TensorFlow, JAX
LLM Frameworks: LangChain, LlamaIndex, vLLM
Techniques: RAG, Vector DBs, LoRA/QLoRA
Inference: Quantization, Speculative Decoding
Optimization: CUDA, FlashAttention, KV-Cache

☁️ Cloud & Infrastructure

AWS: EC2, S3, SageMaker, Lambda (Certified)
Oracle: GenAI, Vector Search, Cloud Infrastructure
Microsoft: Azure, Fabric (Certified)
Containerization: Docker, Kubernetes, Helm
MLOps: CI/CD, Monitoring, Reproducibility

💻 Software Engineering

Languages: Python, Go, C++, Java, SQL
Web: FastAPI, Flask, REST APIs
Databases: PostgreSQL, Neo4j, Redis
Messaging: Kafka, RabbitMQ
Systems: Distributed Systems, DSA

🎓 Specialized Expertise Matrix

Where I have deep, production-tested knowledge:

Domain	Depth	Key Projects	Confidence
LLM Inference & Optimization	⭐⭐⭐⭐⭐	vLLM benchmarks, quantization, speculative decoding	Production-tested
GPU Optimization & CUDA	⭐⭐⭐⭐⭐	Memory reduction, custom kernels, attention optimization	12.3× improvements
RAG Systems & Vector DBs	⭐⭐⭐⭐⭐	Multi-strategy retrieval, offline RAG, embedding selection	40+ projects
Multi-Agent Systems	⭐⭐⭐⭐⭐	ReAct, CoT, ToT, multi-agent orchestration	4 reasoning strategies
Cloud Architecture	⭐⭐⭐⭐	AWS, Oracle, scalable ML pipelines	Certified
Production MLOps	⭐⭐⭐⭐	CI/CD, Kubernetes, monitoring, reproducibility	Healthcare + Enterprise
Computer Vision	⭐⭐⭐⭐	Detection, segmentation, video analysis	Medical + Telecom
Data Engineering	⭐⭐⭐⭐	ETL pipelines, Spark, streaming, warehouse design	Petabyte-scale

🏆 Certifications & Continuous Learning

Professional Certifications

Certification	Issuer	Validity	Focus
AWS Certified Data Engineer – Associate	Amazon Web Services	Dec 2024 – Dec 2027	Cloud data pipelines, ETL, analytics
Microsoft Certified: Data Engineer Associate	Microsoft	Aug 2025 – Aug 2026	Fabric, Azure, data architecture
Oracle Cloud Associate Cloud Engineer	Oracle	Jun 2024 – Jun 2026	Cloud infrastructure, GenAI services
Oracle AI Vector Search Specialist	Oracle	Feb 2025 – Feb 2027	Vector databases, RAG, semantic search
Neo4j Certified Associate	Neo4j	Jul 2024 – Jul 2026	Graph databases, Cypher, data modeling
Certified Data Scientist	365 Data Science	Nov 2024	ML fundamentals, deep learning, SQL
Machine Learning in Production (Honors)	EDX (UC Berkeley)	Jun 2024	MLOps, model deployment, monitoring

Specialized Technical Training

Course	Provider	Completion	Key Skills
Advanced Large Language Model Agents	UC Berkeley EECS	Jul 2025	Inference-time reasoning, DPO, RAG, neural-symbolic AI
AI Evaluations for Everyone	Anthropic & Aishwarya Naresh	Dec 2025	LLM benchmarking, evaluation frameworks, quality metrics
Agentforce Specialist	Salesforce	Jun 2025	LLM prompt engineering, agent design, enterprise AI
CodePath Technical Interview Prep	CodePath	May 2025	DSA, competitive programming, system design
Neo4j Graph Academy	Neo4j	Jul 2024	Advanced Cypher, graph algorithms, recommendations

📈 Key Achievements & Impact

┌──────────────────────────────────────────────────────────────────────┐
│                     PRODUCTION IMPACT METRICS                        │
├──────────────────────────────────────────────────────────────────────┤
│                                                                      │
│  🚀 Performance                   🏆 Research & Publications         │
│  ├─ 12.3× throughput improvement  ├─ IEEE ICC 2026 publication     │
│  ├─ 4× memory reduction           ├─ 1700× COMSOL speedup (PINN)   │
│  └─ <50ms p99 latency             └─ 3 patent-eligible algorithms   │
│                                                                      │
│  📚 Open Source & Community        🎓 Career Development             │
│  ├─ 40+ public repositories       ├─ 6+ cloud certifications       │
│  ├─ 2.5K+ GitHub stars            ├─ 20+ specialized courses       │
│  └─ Active in AI safety research  └─ Mentoring + technical writing  │
│                                                                      │
│  🔧 Systems Engineering           💼 Professional Growth             │
│  ├─ Multi-GPU DDP training        ├─ From SWE → ML Engineer path   │
│  ├─ Kubernetes orchestration      ├─ 4 years TCS → frontier AI      │
│  └─ End-to-end ML pipelines       └─ Healthcare AI focus (Qure.ai) │
│                                                                      │
└──────────────────────────────────────────────────────────────────────┘

🎯 What I'm Currently Working On

🔭 Optimizing LLM inference pipelines for sub-50ms latency at scale
🌱 Advanced KV-cache management and speculative decoding techniques
👯 Multi-agent orchestration for real-world problem-solving workflows
🏥 Clinical AI automation with FHIR/EPIC integrations at Qure.ai
💬 AI safety & evaluation frameworks for responsible model deployment

🌍 Why I'm Unique

Full-Stack Expertise: From CUDA kernels to multi-agent orchestration
Published Researcher: IEEE ICC 2026 paper with measurable real-world impact
Production-Tested: Healthcare AI, cloud infrastructure, enterprise systems
GPU Specialist: Achieved 12.3× improvements through systematic optimization
Hands-On Infrastructure: Built Kubernetes clusters, designed ML pipelines, deployed at scale
Open Source Leader: 40+ repos, active in AI safety and benchmarking communities

📫 Let's Connect & Collaborate

I'm actively seeking opportunities in ML Engineering, Deep Learning, LLM/GenAI, and Cloud Architecture roles. Whether you're building frontier AI systems, scaling ML infrastructure, or advancing AI safety—let's talk!

Reach out for:

🔍 Technical collaboration on ML/AI projects
💼 ML Engineering & LLM Engineer opportunities
🎓 Mentorship in LLM optimization & RAG systems
🚀 Open-source contributions & research partnerships

⭐ If you find my projects useful, consider giving them a star and sharing with others building the future of AI!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly