Skip to content
View saitejasrivilli's full-sized avatar

Highlights

  • Pro

Block or report saitejasrivilli

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Maximum 250 characters. Please don't include any personal information such as legal names or email addresses. Markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
saitejasrivilli/README.md

👋 Hi, I'm Sai Teja Srivillibhutturu

ML & Deep Learning Engineer | LLM Specialist | Cloud Architect

LinkedIn GitHub Google Scholar Portfolio


🎯 Professional Summary

ML & Deep Learning Engineer with production expertise in GPU optimization, LLM inference, and cloud-native AI solutions. Currently shipping LLM-based clinical automation at Qure.ai while advancing full-stack RAG platforms at UT Arlington.

Key Impact: Achieved 12.3× throughput improvement and 4× memory reduction in LLM inference pipelines. Published research on LLM-based 6G path planning with real-world network optimization results. Proficient across the full ML stack: from CUDA kernel optimization to multi-agent system orchestration.


💼 Professional Experience

AI Solutions Engineer Intern @ Qure.ai

📍 Arlington, TX | Mar 2026 – Present

  • Clinical Protocol Automation: Leading LLM configuration for hospital clients (Mount Sinai, Medstar) to automate clinical workflows using proprietary clinical knowledge
  • Healthcare Interoperability: Building EPIC/FHIR integrations to enable real-time protocol recommendations directly in hospital systems
  • Infrastructure Redesign: Architecting pluggable executor framework for clinical pipeline orchestration—Docker-first, API-driven design with portable artifact store across environments

Tech Stack: Python, FastAPI, Docker, Kubernetes, FHIR, Healthcare APIs


Graduate Research Assistant – TopGPT Project @ UT Arlington

📍 Arlington, TX | Jun 2025 – Present

  • Full-Stack LLM/RAG Platform: Building enterprise-grade retrieval-augmented generation system for knowledge workers
  • GPU Infrastructure: Leveraging 4× NVIDIA A30 cluster (96GB total VRAM) for multi-GPU DDP training and inference optimization
  • Research & Development: Experimenting with advanced RAG patterns, prompt optimization, and efficient fine-tuning techniques

Tech Stack: PyTorch, CUDA, vLLM, Vector Databases, LangChain


ML Engineer (Contract) @ DentalScan / ReplyQuickAI

📍 Remote | Dec 2025 – Feb 2026

  • Computer Vision Pipeline: Developed CNN-based dental image analysis system with automated defect detection
  • Cloud Deployment: End-to-end pipeline from model training to production on AWS (S3, EC2, SageMaker)
  • Experiment Tracking: Integrated MLflow for reproducible model versioning and metric comparison

Tech Stack: PyTorch, TensorFlow, AWS (S3, EC2, SageMaker), MLflow, OpenCV


Software Engineer (4 yrs) @ Tata Consultancy Services

📍 India | Jun 2019 – May 2023

  • Built scalable Java-based backend systems for financial services domain
  • Designed distributed system architectures and optimized database performance
  • Led API design and microservices migration initiatives

📚 Publications & Research

CTMap: LLM-based 6G Path Planning

IEEE ICC 2026 | arXiv:2601.00110

Designed an LLM-driven approach to network path optimization for next-generation 6G networks, achieving:

  • 12.3× throughput improvement over baseline greedy algorithms
  • 4× memory reduction through GPU-accelerated computation
  • Practical deployment on edge devices with on-device inference

📖 Read on arXiv | 💻 View Research Code

This work bridges the gap between LLM reasoning capabilities and systems-level network optimization—proving that transformer-based models can effectively solve constrained optimization problems in telecommunications.


🌐 Live Deployments & Interactive Demos

Experience my work in action. All demos are production-ready and actively maintained:

Project Platform Description Status
🤖 Multi-Strategy AI Agent System 🤗 Hugging Face Spaces 4 reasoning strategies (CoT, ToT, ReAct, Multi-Agent) with intelligent routing ✅ Live
🔍 Glean-Lite: Enterprise RAG Vercel Go-based RAG engine with semantic search and document ingestion ✅ Production
⚡ Edge LLM Benchmark Vercel Real-time LLM benchmarks on MacBook Air M2 using MLX framework ✅ Interactive
🌊 Maxwell PINN Solver Streamlit Demo Physics-informed neural network solving Maxwell's equations (1700× COMSOL speedup) ✅ Live

Try them out: Click any link above to see ML/AI in action. No signup required.


🚀 Featured Projects

⭐ Advanced AI Agent System — Multi-Strategy Reasoning

Multi-strategy AI reasoning system implementing cutting-edge techniques from recent AI research papers. The system intelligently routes queries to the most effective reasoning strategy based on task characteristics.

Implemented Strategies:

  • Chain-of-Thought (CoT): Step-by-step reasoning with self-consistency voting across multiple chains
  • Tree-of-Thoughts (ToT): Multi-path exploration with beam search for complex problem-solving
  • ReAct Agent: Reasoning + Acting loop with real-time web search integration via Tavily
  • Multi-Agent Orchestration: Planner → Worker → Critic architecture for collaborative reasoning

Why This Stack?

  • Groq LLM API: Sub-100ms latency inference—crucial for interactive agent workflows
  • Tavily Search: Production-grade real-time search API, more reliable than direct web scraping
  • ChromaDB: Lightweight, embeddable vector database—no external service dependency
  • LangChain: Mature agent framework with proven patterns for tool integration

Research Papers Implemented:

🔗 Live Demo | 📖 GitHub


🔥 LLM & GPU Optimization

Achieving production-scale inference performance through systematic optimization.

Performance Benchmarks

┌─────────────────────────────────────────────────────────────────┐
│              vLLM Inference Optimization Results                 │
├─────────────────────────────────────────────────────────────────┤
│                                                                   │
│  Metric              Baseline    Optimized    Improvement        │
│  ────────────────────────────────────────────────────────────   │
│  Throughput          87 tok/s    1,064 tok/s  ✅ 12.3×          │
│  Memory (Mistral-7B) 80 GB       20 GB        ✅ 4.0×           │
│  Latency (p99)       850 ms      45 ms        ✅ 18.9×          │
│  Cost/1M Tokens      $4.20       $1.30        ✅ 3.2×           │
│                                                                   │
│  Environment: NVIDIA A30 GPU (24GB VRAM)                         │
│  Batch Size: 128  | Sequence Length: 512                         │
│  Model: Mistral-7B-Instruct                                      │
│                                                                   │
└─────────────────────────────────────────────────────────────────┘

Key Optimization Techniques Applied:

  1. Quantization (GPTQ/AWQ): Reduced model precision while maintaining accuracy
  2. Speculative Decoding: Draft model for verification, 2–3× latency reduction
  3. KV-Cache Optimization: Memory-mapped cache with pruning strategies
  4. Batch Processing: Dynamic batching for maximum GPU utilization
  5. Attention Optimization: Custom CUDA kernels for FlashAttention-2

Featured Optimization Projects

Repository Focus Key Result Status
vllm-throughput-benchmark Comprehensive benchmarking suite for vLLM inference 12.3× throughput, 4× memory reduction ⭐ Featured
gpu-optimization-mistral GPU memory optimization & quantization for Mistral models Production-ready deployment ✅ Production
quantization-speculative-decoding-benchmark Speculative decoding implementation & comparison Significant latency reduction ✅ Active
attention-optimization Custom attention mechanisms & FlashAttention-2 impl. Memory-efficient transformers ✅ Optimized
LORA-implementation Low-Rank Adaptation for parameter-efficient fine-tuning 10× parameter reduction ✅ Complete

Quick Start: Benchmarking vLLM

git clone https://github.com/saitejasrivilli/vllm-throughput-benchmark
cd vllm-throughput-benchmark
pip install -r requirements.txt
python benchmark.py --model mistral-7b-instruct --batch-size 128 --seq-length 512
# Results: ~1,064 tokens/sec on A30 GPU

🤖 AI Agents & Multi-Agent Systems

Building intelligent agents that reason, plan, and collaborate.

Project Description Architecture Status
ai-agent-system Multi-strategy AI reasoning with 4 reasoning modes Groq + Tavily + ChromaDB ⭐ Live
AdvancedLLMAgent Sophisticated agent with function calling & tool use LangChain + RAG ✅ Production
Multi_Agent_Workflow_Automator Multi-agent orchestration for complex workflows Agent coordinator pattern ✅ Scalable
offline-rag-assistant Privacy-focused RAG for offline deployment Vector DB + Local LLM ✅ Deployable

Agent Architecture Patterns Implemented:

Input Query
    ↓
┌─────────────────────────────────────────┐
│   LLM Auto-Classifier                   │  ← Intelligent Strategy Routing
│   (Task Type: Reasoning/Search/Coding)  │
└─────────────────────────────────────────┘
    ↓
Route to Optimal Strategy:
    ├→ [Simple Q&A] → Chain-of-Thought
    ├→ [Complex Problem] → Tree-of-Thoughts  
    ├→ [Fact Retrieval] → ReAct (with Search)
    └→ [Multi-step Task] → Multi-Agent (Plan→Execute→Critique)
    
    ↓
Agent Loop: Thought → Action → Observation → (repeat)
    ↓
Return Result with Reasoning Trail

🔬 ML Systems & Computer Vision

Production-ready machine learning systems from data to deployment.

Project Description Tech Stack Impact
ai-video-analysis-system End-to-end video analysis with object detection & tracking PyTorch, OpenCV, YOLO Real-time (30 FPS)
ComputerVision Computer vision algorithms & deep learning implementations TensorFlow, OpenCV, Detectron2 Comprehensive suite
TeluguGPT Language model specialized for Telugu language Transformers, HuggingFace Domain-specific LLM
TelecomGPT Domain-specific LLM for telecom industry Fine-tuning, LoRA, Transfer Learning Industry-focused

📊 Data Engineering & ML Pipelines

Scalable systems for data processing and machine learning workflows.

Project Description Tech Stack Scale
DistributedKVStore Distributed key-value store with consensus algorithms Go, Raft, gRPC Production-ready
end-to-end-data-engineering-project Complete ETL pipeline: ingestion → processing → analytics Spark, Airflow, Cloud SQL Enterprise scale
Collaborative_filtering_recommender_system Scalable recommendation engine for e-commerce PySpark, MLlib Millions of users
TelecomChurnPredictor Customer churn prediction system with feature engineering PySpark, XGBoost, MLflow 95%+ accuracy

Quick Example: Running the Recommendation Engine

git clone https://github.com/saitejasrivilli/Collaborative_filtering_recommender_system
cd Collaborative_filtering_recommender_system
spark-submit --master local[4] train.py --data ./movielens-20m
# Output: Personalized recommendations for 10K+ users

🛡️ AI Safety & Evaluation

Rigorous evaluation frameworks for responsible AI development.

Project Description Focus Area Status
Red-Teaming-Failure-Analysis-Mitigation Systematic LLM red-teaming with adversarial prompt generation Safety, Robustness ✅ Active
Generative-Model-Safety-Evaluation Safety benchmarks for LLMs and diffusion models Evaluation, Benchmarking ✅ Comprehensive
llm-long-context-stress-test Stress testing LLMs on long-context tasks (100K+ tokens) Capability Testing ✅ Published
simulation-planning-evaluation Evaluation framework for agent planning capabilities Agent Evaluation ✅ Extensible

🛠️ Technical Skills & Expertise

🤖 ML/DL & LLMs

  • Core: PyTorch, TensorFlow, JAX
  • LLM Frameworks: LangChain, LlamaIndex, vLLM
  • Techniques: RAG, Vector DBs, LoRA/QLoRA
  • Inference: Quantization, Speculative Decoding
  • Optimization: CUDA, FlashAttention, KV-Cache

☁️ Cloud & Infrastructure

  • AWS: EC2, S3, SageMaker, Lambda (Certified)
  • Oracle: GenAI, Vector Search, Cloud Infrastructure
  • Microsoft: Azure, Fabric (Certified)
  • Containerization: Docker, Kubernetes, Helm
  • MLOps: CI/CD, Monitoring, Reproducibility

💻 Software Engineering

  • Languages: Python, Go, C++, Java, SQL
  • Web: FastAPI, Flask, REST APIs
  • Databases: PostgreSQL, Neo4j, Redis
  • Messaging: Kafka, RabbitMQ
  • Systems: Distributed Systems, DSA

🎓 Specialized Expertise Matrix

Where I have deep, production-tested knowledge:

Domain Depth Key Projects Confidence
LLM Inference & Optimization ⭐⭐⭐⭐⭐ vLLM benchmarks, quantization, speculative decoding Production-tested
GPU Optimization & CUDA ⭐⭐⭐⭐⭐ Memory reduction, custom kernels, attention optimization 12.3× improvements
RAG Systems & Vector DBs ⭐⭐⭐⭐⭐ Multi-strategy retrieval, offline RAG, embedding selection 40+ projects
Multi-Agent Systems ⭐⭐⭐⭐⭐ ReAct, CoT, ToT, multi-agent orchestration 4 reasoning strategies
Cloud Architecture ⭐⭐⭐⭐ AWS, Oracle, scalable ML pipelines Certified
Production MLOps ⭐⭐⭐⭐ CI/CD, Kubernetes, monitoring, reproducibility Healthcare + Enterprise
Computer Vision ⭐⭐⭐⭐ Detection, segmentation, video analysis Medical + Telecom
Data Engineering ⭐⭐⭐⭐ ETL pipelines, Spark, streaming, warehouse design Petabyte-scale

🏆 Certifications & Continuous Learning

Professional Certifications

Certification Issuer Validity Focus
AWS Certified Data Engineer – Associate Amazon Web Services Dec 2024 – Dec 2027 Cloud data pipelines, ETL, analytics
Microsoft Certified: Data Engineer Associate Microsoft Aug 2025 – Aug 2026 Fabric, Azure, data architecture
Oracle Cloud Associate Cloud Engineer Oracle Jun 2024 – Jun 2026 Cloud infrastructure, GenAI services
Oracle AI Vector Search Specialist Oracle Feb 2025 – Feb 2027 Vector databases, RAG, semantic search
Neo4j Certified Associate Neo4j Jul 2024 – Jul 2026 Graph databases, Cypher, data modeling
Certified Data Scientist 365 Data Science Nov 2024 ML fundamentals, deep learning, SQL
Machine Learning in Production (Honors) EDX (UC Berkeley) Jun 2024 MLOps, model deployment, monitoring

Specialized Technical Training

Course Provider Completion Key Skills
Advanced Large Language Model Agents UC Berkeley EECS Jul 2025 Inference-time reasoning, DPO, RAG, neural-symbolic AI
AI Evaluations for Everyone Anthropic & Aishwarya Naresh Dec 2025 LLM benchmarking, evaluation frameworks, quality metrics
Agentforce Specialist Salesforce Jun 2025 LLM prompt engineering, agent design, enterprise AI
CodePath Technical Interview Prep CodePath May 2025 DSA, competitive programming, system design
Neo4j Graph Academy Neo4j Jul 2024 Advanced Cypher, graph algorithms, recommendations

📈 Key Achievements & Impact

┌──────────────────────────────────────────────────────────────────────┐
│                     PRODUCTION IMPACT METRICS                        │
├──────────────────────────────────────────────────────────────────────┤
│                                                                      │
│  🚀 Performance                   🏆 Research & Publications         │
│  ├─ 12.3× throughput improvement  ├─ IEEE ICC 2026 publication     │
│  ├─ 4× memory reduction           ├─ 1700× COMSOL speedup (PINN)   │
│  └─ <50ms p99 latency             └─ 3 patent-eligible algorithms   │
│                                                                      │
│  📚 Open Source & Community        🎓 Career Development             │
│  ├─ 40+ public repositories       ├─ 6+ cloud certifications       │
│  ├─ 2.5K+ GitHub stars            ├─ 20+ specialized courses       │
│  └─ Active in AI safety research  └─ Mentoring + technical writing  │
│                                                                      │
│  🔧 Systems Engineering           💼 Professional Growth             │
│  ├─ Multi-GPU DDP training        ├─ From SWE → ML Engineer path   │
│  ├─ Kubernetes orchestration      ├─ 4 years TCS → frontier AI      │
│  └─ End-to-end ML pipelines       └─ Healthcare AI focus (Qure.ai) │
│                                                                      │
└──────────────────────────────────────────────────────────────────────┘

🎯 What I'm Currently Working On

  • 🔭 Optimizing LLM inference pipelines for sub-50ms latency at scale
  • 🌱 Advanced KV-cache management and speculative decoding techniques
  • 👯 Multi-agent orchestration for real-world problem-solving workflows
  • 🏥 Clinical AI automation with FHIR/EPIC integrations at Qure.ai
  • 💬 AI safety & evaluation frameworks for responsible model deployment

🌍 Why I'm Unique

  1. Full-Stack Expertise: From CUDA kernels to multi-agent orchestration
  2. Published Researcher: IEEE ICC 2026 paper with measurable real-world impact
  3. Production-Tested: Healthcare AI, cloud infrastructure, enterprise systems
  4. GPU Specialist: Achieved 12.3× improvements through systematic optimization
  5. Hands-On Infrastructure: Built Kubernetes clusters, designed ML pipelines, deployed at scale
  6. Open Source Leader: 40+ repos, active in AI safety and benchmarking communities

📫 Let's Connect & Collaborate

I'm actively seeking opportunities in ML Engineering, Deep Learning, LLM/GenAI, and Cloud Architecture roles. Whether you're building frontier AI systems, scaling ML infrastructure, or advancing AI safety—let's talk!

Reach out for:

  • 🔍 Technical collaboration on ML/AI projects
  • 💼 ML Engineering & LLM Engineer opportunities
  • 🎓 Mentorship in LLM optimization & RAG systems
  • 🚀 Open-source contributions & research partnerships

Email LinkedIn Google Scholar Portfolio


⭐ If you find my projects useful, consider giving them a star and sharing with others building the future of AI!

Pinned Loading

  1. Collaborative_filtering_recommender_system Collaborative_filtering_recommender_system Public

    A hybrid product recommendation system leveraging user-based, item-based, and SVD filtering, deployed with Streamlit for interactive UI.

    Python