📊 SEC Filings QA System

Ask complex financial research questions about any public company — answered from real SEC filings with full source attribution.

⚡ Quick Start · 🏗️ Architecture · 📋 Query Types · 📁 Docs

🌟 Overview

A modular RAG pipeline purpose-built for SEC financial filings. It fetches 10-K and 10-Q documents directly from SEC EDGAR, processes and indexes them with a hybrid semantic + keyword retrieval engine, and answers natural language questions with every claim traced back to its exact filing and section.

Built for financial analysts, quant researchers, and developers who need reliable, cited answers from public company filings — not hallucinations.

✨ Key Features

🔍 Hybrid retrieval — 70% semantic + 30% keyword search with financial domain boosting
🧠 Query intent classification — automatically routes risk, financial, strategic, and governance queries to weighted sections
📅 Temporal intelligence — recent filings boosted for time-sensitive questions
🏢 Multi-company support — comparative queries across multiple tickers in one call
📌 Full source attribution — every answer cites the exact filing, section, and page
⚡ Intelligent caching — SEC EDGAR downloads cached by content hash to avoid re-fetching
🧪 Benchmark suite — built-in evaluation framework with query category scoring

🏗️ Architecture

Question
    │
    ▼
QueryProcessor          ← Intent classification, ticker/date/type extraction
    │
    ▼
HybridRetriever         ← Semantic (ChromaDB) + Keyword (BM25) + Domain Boosts
    │                      └── Financial term boost (2x–3x)
    │                      └── Section-aware scoring
    │                      └── Recency boost for temporal queries
    ▼
AnswerGenerator         ← OpenAI GPT with retrieved context + citation formatting
    │
    ▼
Answer + Sources        ← Cited response with filing, section, and date metadata

Core Modules

Module	Responsibility
`SECDataFetcher`	Fetches 10-K / 10-Q filings from SEC EDGAR with caching
`DocumentProcessor`	Chunks filings while preserving section structure
`QueryProcessor`	Parses intent, tickers, dates, and filing types from queries
`VectorStore`	Manages semantic embeddings via ChromaDB
`HybridRetriever`	Combines semantic + keyword search with financial domain scoring
`AnswerGenerator`	Synthesizes cited answers via OpenAI GPT

📋 Supported Query Types

Type	Example
Single company	`"What are Apple's main risk factors?"`
Comparative	`"Compare Apple and Microsoft's R&D spending"`
Temporal / trend	`"How has Amazon's revenue guidance changed over time?"`
Multi-dimensional	`"Apple's 2022 10-K cybersecurity risks"`
Strategic	`"How are major tech companies approaching AI investment?"`

⚡ Quick Start

Prerequisites

Python 3.8+
OpenAI API key

1. Clone and install

git clone https://github.com/overcastbulb/sec-filings-qa-system.git
cd sec-filings-qa-system

python -m venv .venv
source .venv/bin/activate   # macOS/Linux
# .\.venv\Scripts\activate  # Windows

pip install -r requirements.txt

2. Set environment variables

export OPENAI_API_KEY="your-openai-api-key"

# Optional
export SEC_DATA_DIR="./data"
export SEC_CACHE_DIR="./data/cache"

3. Run the interactive demo

python sec_qa_system.py

4. Use as a library

from sec_qa_system import SECQASystem

qa = SECQASystem()

# Ingest filings for one or more tickers
qa.ingest_company_data(['AAPL', 'MSFT', 'GOOGL'])

# Ask a question
result = qa.query("What are Apple's primary revenue drivers?")
print(result['answer'])
print(f"Sources: {len(result['sources'])} documents")

📊 Sample Output

Query: "What are the primary revenue drivers for major technology companies?"

Answer:
Technology companies demonstrate diverse revenue streams including subscription
services, cloud computing, advertising, and hardware sales...

Sources: 8 documents across 4 companies
Coverage: 3 companies, 2 filing types (10-K, 10-Q)
Response time: 3.2 seconds

🔧 Advanced Configuration

Tune financial term boosts

# Increase relevance weight for specific terms
retriever.financial_terms.update({
    'artificial intelligence': 2.5,
    'cloud revenue': 2.7,
    'operating margin': 2.3
})

Tune section weights per query type

retriever.section_weights['risk']['Cybersecurity'] = 2.8
retriever.section_weights['financial']['Revenue Recognition'] = 3.2

Run the benchmark suite

from sec_qa_system import SECQASystem, SystemBenchmark

qa = SECQASystem()
benchmark = SystemBenchmark(qa)
results = benchmark.run_evaluation_suite()
print(results['evaluation_summary'])

📁 Project Structure

sec-filings-qa-system/
├── sec_qa_system.py         # Main pipeline — ingest, query, answer
├── hybrid_retrieval.py      # Hybrid retrieval engine
├── requirements.txt
├── docs/
│   ├── architecture_overview.md
│   ├── technical_documentation.md
│   ├── setup_guide.md
│   └── evaluation_results.md
├── examples/
│   ├── sample_queries.py
│   ├── benchmark_tests.py
│   └── demo_results.md
├── tests/
│   ├── test_query_processor.py
│   └── test_retrieval.py
└── data/                    # Filing cache (gitignored)

📈 Performance

Metric	Value
Avg. response time	2–5 seconds
Documents retrieved per query	5–15
Chunk capacity	100K+ efficiently
Memory per 1,000 chunks	~100MB

🗺️ Roadmap

Streamlit UI for non-developer users
Support for 8-K filings (earnings releases, material events)
Local LLM mode (Ollama / LLaMA) — no OpenAI dependency
Docker setup for one-command deployment
Async batch ingestion for large ticker lists

🤝 Contributing

git checkout -b feature/your-feature
git commit -m "feat: describe your change"
git push origin feature/your-feature
# Open a Pull Request

📄 License

This project is licensed under the MIT License.

Built for financial researchers who need cited, reliable answers from SEC filings.

Report a Bug · Technical Docs · Architecture

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

📊 SEC Filings QA System

🌟 Overview

✨ Key Features

🏗️ Architecture

Core Modules

📋 Supported Query Types

⚡ Quick Start

Prerequisites

1. Clone and install

2. Set environment variables

3. Run the interactive demo

4. Use as a library

📊 Sample Output

🔧 Advanced Configuration

Tune financial term boosts

Tune section weights per query type

Run the benchmark suite

📁 Project Structure

📈 Performance

🗺️ Roadmap

🤝 Contributing

📄 License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 43 Commits
.github		.github
data		data
docs		docs
examples		examples
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
hybrid_retrieval.py		hybrid_retrieval.py
requirements.txt		requirements.txt
sec_qa_system.py		sec_qa_system.py

Folders and files

Latest commit

History

Repository files navigation

📊 SEC Filings QA System

🌟 Overview

✨ Key Features

🏗️ Architecture

Core Modules

📋 Supported Query Types

⚡ Quick Start

Prerequisites

1. Clone and install

2. Set environment variables

3. Run the interactive demo

4. Use as a library

📊 Sample Output

🔧 Advanced Configuration

Tune financial term boosts

Tune section weights per query type

Run the benchmark suite

📁 Project Structure

📈 Performance

🗺️ Roadmap

🤝 Contributing

📄 License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages