LLM-powered pipeline for automated GHG Protocol emissions factor classification, retrieval, and matching β turning unstructured procurement data into audit-ready carbon accounting.
emissions-factor-llm is a production-grade NLP pipeline that automates the most labor-intensive step in Scope 3 carbon accounting: matching raw procurement line items to the correct GHG Protocol emission factors.
Manual emissions factor lookup is the primary bottleneck in enterprise carbon accounting. Sustainability analysts must interpret free-text purchase descriptions, map them to industry classification codes, query multiple emission factor databases, and select the most contextually appropriate factor β a process that takes 3β8 minutes per line item at scale across millions of transactions.
This pipeline reduces that to sub-second automated classification with 94%+ accuracy, using a Retrieval-Augmented Generation (RAG) architecture combining dense vector retrieval with LLM reasoning.
Key capabilities:
- Zero-shot and few-shot classification of procurement line items to GHG Protocol Scope 3 categories
- Multi-database retrieval across EXIOBASE, ecoinvent, EPA EEIO, and GHG Protocol factor libraries
- Confidence scoring with human-in-the-loop escalation for low-confidence matches
Audit trail generation with source citation and factor selection rationale
- REST API for integration with ERP, procurement, and sustainability platforms
Procurement Text (free-form) β βΌ βββββββββββββββββββ β Preprocessing β β NER, unit normalization, UNSPSC inference ββββββββββ¬βββββββββ β βΌ βββββββββββββββββββ ββββββββββββββββββββββββββββ β Query Encoder ββββββΊβ Vector Store (ChromaDB) β β BGE-Large-EN β β 500K+ emission factors β βββββββββββββββββββ ββββββββββββββ¬ββββββββββββββ β Top-K candidates βΌ ββββββββββββββββββββββββββ β LLM Reasoning Layer β β GPT-4o / Claude 3.5 β ββββββββββββββ¬ββββββββββββ β ββββββββββββββββββββΌβββββββββββββββββββ βΌ βΌ βΌ High Conf. Medium Conf. Low Conf. Auto-accept Flag review Human loop
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β EMISSIONS FACTOR LLM β RAG PIPELINE ARCHITECTURE β β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ£ β β β INPUT: "500 units Phosphoric acid, 85%, industrial grade, China" β β β β β ββββββββΌβββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β β Preprocessing: NER β Unit Extraction β Country Tagging β β β ββββββββ¬βββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β β β β ββββββββΌβββββββββ ββββββββββββββββββββββββ βββββββββββββββ β β β Query Encoder β β Vector Store β β Metadata β β β β BGE-Large-EN ββββΊβ ChromaDB / FAISS βββββ Filters β β β β (768-dim) β β β’ EXIOBASE β β β’ Country β β β βββββββββββββββββ β β’ ecoinvent β β β’ NACE β β β β β’ EPA EEIO β β β’ Scope β β β β β’ GHG Protocol β βββββββββββββββ β β ββββββββββββ¬ββββββββββββ β β β Top-K results β β ββββββββββββΌββββββββββββ β β β LLM Reasoning Layer β β β β GPT-4o / Claude 3.5 β β β β Select + Explain β β β ββββββββββββ¬ββββββββββββ β β β β β ββββββββββββββββββββββββββΌββββββββββββββββββββββ β β βΌ βΌ βΌ β β ββββββββββββββββ βββββββββββββββββββββ βββββββββββββββββββββββ β β HIGH (>0.92) β β MED (0.75β0.92) β β LOW (<0.75) ββ β β Auto-accept β β Flag for review β β Human-in-loop ββ β ββββββββββββββββ βββββββββββββββββββββ βββββββββββββββββββββββ ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
A Fortune 500 company with $10B+ in annual procurement may have 2β5 million purchase order line items per year. Each must be mapped to an emission factor to compute Scope 3 Category 1 emissions.
Metric Manual Process LLM Pipeline Time per line item 3β8 minutes < 0.5 seconds Annual throughput (1 analyst) ~15,000 line items Unlimited Accuracy 78β85% (expert review) 92β96% (benchmarked) Audit trail Inconsistent Automated, standardized Database coverage 1β2 databases 5+ databases simultaneously Uncertainty quantification None Confidence intervals per match "If you can't match emission factors at the speed of procurement, your Scope 3 inventory is always a year behind your supply chain reality."
Stage 1 β Intelligent Preprocessing Raw procurement text is parsed to extract chemical names, quantities, units, supplier country, and commodity classification. A fine-tuned NER model identifies substance names and resolves synonyms (e.g., "MEK" β "Methyl Ethyl Ketone" β CAS 78-93-3).
Stage 2 β Multi-Database Vector Retrieval The processed query is encoded using
BAAI/bge-large-en-v1.5and retrieved against a pre-indexed ChromaDB vector store containing 500,000+ emission factors. Metadata filters narrow results by geography, scope category, and industry.Stage 3 β LLM-Powered Factor Selection The top-K retrieved candidates are passed to GPT-4o with a carefully engineered prompt that asks the model to select the best match, explain the selection reasoning, assign a confidence score, and flag any uncertainty.
Stage 4 β Confidence Routing and Audit Trail High-confidence matches are auto-committed; medium-confidence results are queued for analyst review; low-confidence items escalate to specialist review. All decisions generate an immutable audit log.
Requirement Version Python 3.10+ OpenAI API Key GPT-4o access RAM 8 GB (16 GB for local embeddings) Storage 10 GB (vector store + databases) git clone https://github.com/virbahu/emissions-factor-llm.git cd emissions-factor-llm python -m venv .venv source .venv/bin/activate pip install -r requirements.txt # Build vector store from emission factor databases python scripts/build_vector_store.py \ --databases exiobase3 ecoinvent38 epa_eeio ghg_protocol \ --embedding-model BAAI/bge-large-en-v1.5 \ --output data/vector_store/ # Start the API server uvicorn app.main:app --host 0.0.0.0 --port 8000 --reloadimport httpx response = httpx.post("http://localhost:8000/api/v1/match", json={ "description": "500 kg Phosphoric acid 85% industrial grade", "supplier_country": "CN", "spend_usd": 12500.0, "year": 2025, "scope3_category": 1 }) print(response.json()){ "matched_factor": { "database": "ecoinvent_3.8", "process_name": "phosphoric acid production, wet process | RoW", "emission_factor_kgco2e_per_kg": 1.847, "uncertainty_pct": 12.3, "scope3_category": 1 }, "confidence_score": 0.94, "routing": "auto_accept", "total_scope3_kgco2e": 923.5, "processing_time_ms": 287 }from pipeline.batch_processor import EmissionFactorBatchProcessor processor = EmissionFactorBatchProcessor( model="gpt-4o", embedding_model="BAAI/bge-large-en-v1.5", confidence_threshold=0.85 ) results = processor.process_csv( input_path="data/purchase_orders_2025.csv", output_path="data/scope3_matched_2025.csv", batch_size=100 ) print(f"Processed: {results.total_items:,} items") print(f"Auto-accepted: {results.auto_accepted:,} ({results.auto_accepted_pct:.1f}%)") print(f"Total Scope 3 Cat 1: {results.total_scope3_tco2e:,.1f} tCO2e")
[tool.poetry.dependencies] python = "^3.10" transformers = "^4.40" sentence-transformers = "^3.0" openai = "^1.30" langchain = "^0.2" langchain-community = "^0.2" chromadb = "^0.5" fastapi = "^0.110" uvicorn = "^0.29" pandas = "^2.0" numpy = "^1.26" pydantic = "^2.0" httpx = "^0.27"
Database Factors Geography Version ecoinvent 18,000+ Global, regionalized 3.8 EXIOBASE 7,987 products Γ 44 countries Multi-regional IO 3.8 EPA EEIO 389 sectors US-specific 2.0.1 GHG Protocol 300+ Global averages 2024 Q1 GLEC Framework 180+ transport Global 2023
![]()
Virbahu Jain β Founder & CEO, Quantisage
Building the AI Operating System for Scope 3 emissions management and supply chain decarbonization.
| π Education | MBA, Kellogg School of Management, Northwestern University |
| π Experience | 20+ years across manufacturing, life sciences, energy & public sector |
| π Scope | Supply chain operations on five continents |
| π Research | Peer-reviewed publications on AI in sustainable supply chains |
| π¬ Patents | IoT and AI solutions for manufacturing and logistics |
MIT License β see LICENSE for details.