Skip to content

overcastbulb/sec-filings-qa-system

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

43 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸ“Š SEC Filings QA System

Ask complex financial research questions about any public company β€” answered from real SEC filings with full source attribution.

Python ChromaDB OpenAI SEC EDGAR License: MIT Code style: black

⚑ Quick Start Β· πŸ—οΈ Architecture Β· πŸ“‹ Query Types Β· πŸ“ Docs


🌟 Overview

A modular RAG pipeline purpose-built for SEC financial filings. It fetches 10-K and 10-Q documents directly from SEC EDGAR, processes and indexes them with a hybrid semantic + keyword retrieval engine, and answers natural language questions with every claim traced back to its exact filing and section.

Built for financial analysts, quant researchers, and developers who need reliable, cited answers from public company filings β€” not hallucinations.


✨ Key Features

  • πŸ” Hybrid retrieval β€” 70% semantic + 30% keyword search with financial domain boosting
  • 🧠 Query intent classification β€” automatically routes risk, financial, strategic, and governance queries to weighted sections
  • πŸ“… Temporal intelligence β€” recent filings boosted for time-sensitive questions
  • 🏒 Multi-company support β€” comparative queries across multiple tickers in one call
  • πŸ“Œ Full source attribution β€” every answer cites the exact filing, section, and page
  • ⚑ Intelligent caching β€” SEC EDGAR downloads cached by content hash to avoid re-fetching
  • πŸ§ͺ Benchmark suite β€” built-in evaluation framework with query category scoring

πŸ—οΈ Architecture

Question
    β”‚
    β–Ό
QueryProcessor          ← Intent classification, ticker/date/type extraction
    β”‚
    β–Ό
HybridRetriever         ← Semantic (ChromaDB) + Keyword (BM25) + Domain Boosts
    β”‚                      └── Financial term boost (2x–3x)
    β”‚                      └── Section-aware scoring
    β”‚                      └── Recency boost for temporal queries
    β–Ό
AnswerGenerator         ← OpenAI GPT with retrieved context + citation formatting
    β”‚
    β–Ό
Answer + Sources        ← Cited response with filing, section, and date metadata

Core Modules

Module Responsibility
SECDataFetcher Fetches 10-K / 10-Q filings from SEC EDGAR with caching
DocumentProcessor Chunks filings while preserving section structure
QueryProcessor Parses intent, tickers, dates, and filing types from queries
VectorStore Manages semantic embeddings via ChromaDB
HybridRetriever Combines semantic + keyword search with financial domain scoring
AnswerGenerator Synthesizes cited answers via OpenAI GPT

πŸ“‹ Supported Query Types

Type Example
Single company "What are Apple's main risk factors?"
Comparative "Compare Apple and Microsoft's R&D spending"
Temporal / trend "How has Amazon's revenue guidance changed over time?"
Multi-dimensional "Apple's 2022 10-K cybersecurity risks"
Strategic "How are major tech companies approaching AI investment?"

⚑ Quick Start

Prerequisites

  • Python 3.8+
  • OpenAI API key

1. Clone and install

git clone https://github.com/overcastbulb/sec-filings-qa-system.git
cd sec-filings-qa-system

python -m venv .venv
source .venv/bin/activate   # macOS/Linux
# .\.venv\Scripts\activate  # Windows

pip install -r requirements.txt

2. Set environment variables

export OPENAI_API_KEY="your-openai-api-key"

# Optional
export SEC_DATA_DIR="./data"
export SEC_CACHE_DIR="./data/cache"

3. Run the interactive demo

python sec_qa_system.py

4. Use as a library

from sec_qa_system import SECQASystem

qa = SECQASystem()

# Ingest filings for one or more tickers
qa.ingest_company_data(['AAPL', 'MSFT', 'GOOGL'])

# Ask a question
result = qa.query("What are Apple's primary revenue drivers?")
print(result['answer'])
print(f"Sources: {len(result['sources'])} documents")

πŸ“Š Sample Output

Query: "What are the primary revenue drivers for major technology companies?"

Answer:
Technology companies demonstrate diverse revenue streams including subscription
services, cloud computing, advertising, and hardware sales...

Sources: 8 documents across 4 companies
Coverage: 3 companies, 2 filing types (10-K, 10-Q)
Response time: 3.2 seconds

πŸ”§ Advanced Configuration

Tune financial term boosts

# Increase relevance weight for specific terms
retriever.financial_terms.update({
    'artificial intelligence': 2.5,
    'cloud revenue': 2.7,
    'operating margin': 2.3
})

Tune section weights per query type

retriever.section_weights['risk']['Cybersecurity'] = 2.8
retriever.section_weights['financial']['Revenue Recognition'] = 3.2

Run the benchmark suite

from sec_qa_system import SECQASystem, SystemBenchmark

qa = SECQASystem()
benchmark = SystemBenchmark(qa)
results = benchmark.run_evaluation_suite()
print(results['evaluation_summary'])

πŸ“ Project Structure

sec-filings-qa-system/
β”œβ”€β”€ sec_qa_system.py         # Main pipeline β€” ingest, query, answer
β”œβ”€β”€ hybrid_retrieval.py      # Hybrid retrieval engine
β”œβ”€β”€ requirements.txt
β”œβ”€β”€ docs/
β”‚   β”œβ”€β”€ architecture_overview.md
β”‚   β”œβ”€β”€ technical_documentation.md
β”‚   β”œβ”€β”€ setup_guide.md
β”‚   └── evaluation_results.md
β”œβ”€β”€ examples/
β”‚   β”œβ”€β”€ sample_queries.py
β”‚   β”œβ”€β”€ benchmark_tests.py
β”‚   └── demo_results.md
β”œβ”€β”€ tests/
β”‚   β”œβ”€β”€ test_query_processor.py
β”‚   └── test_retrieval.py
└── data/                    # Filing cache (gitignored)

πŸ“ˆ Performance

Metric Value
Avg. response time 2–5 seconds
Documents retrieved per query 5–15
Chunk capacity 100K+ efficiently
Memory per 1,000 chunks ~100MB

πŸ—ΊοΈ Roadmap

  • Streamlit UI for non-developer users
  • Support for 8-K filings (earnings releases, material events)
  • Local LLM mode (Ollama / LLaMA) β€” no OpenAI dependency
  • Docker setup for one-command deployment
  • Async batch ingestion for large ticker lists

🀝 Contributing

git checkout -b feature/your-feature
git commit -m "feat: describe your change"
git push origin feature/your-feature
# Open a Pull Request

πŸ“„ License

This project is licensed under the MIT License.


Built for financial researchers who need cited, reliable answers from SEC filings.

Report a Bug Β· Technical Docs Β· Architecture

About

RAG pipeline for SEC filings hybrid semantic + keyword retrieval, query intent classification, and cited answers from 10-K/10-Q documents via SEC EDGAR.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages