Ask complex financial research questions about any public company β answered from real SEC filings with full source attribution.
β‘ Quick Start Β· ποΈ Architecture Β· π Query Types Β· π Docs
A modular RAG pipeline purpose-built for SEC financial filings. It fetches 10-K and 10-Q documents directly from SEC EDGAR, processes and indexes them with a hybrid semantic + keyword retrieval engine, and answers natural language questions with every claim traced back to its exact filing and section.
Built for financial analysts, quant researchers, and developers who need reliable, cited answers from public company filings β not hallucinations.
- π Hybrid retrieval β 70% semantic + 30% keyword search with financial domain boosting
- π§ Query intent classification β automatically routes risk, financial, strategic, and governance queries to weighted sections
- π Temporal intelligence β recent filings boosted for time-sensitive questions
- π’ Multi-company support β comparative queries across multiple tickers in one call
- π Full source attribution β every answer cites the exact filing, section, and page
- β‘ Intelligent caching β SEC EDGAR downloads cached by content hash to avoid re-fetching
- π§ͺ Benchmark suite β built-in evaluation framework with query category scoring
Question
β
βΌ
QueryProcessor β Intent classification, ticker/date/type extraction
β
βΌ
HybridRetriever β Semantic (ChromaDB) + Keyword (BM25) + Domain Boosts
β βββ Financial term boost (2xβ3x)
β βββ Section-aware scoring
β βββ Recency boost for temporal queries
βΌ
AnswerGenerator β OpenAI GPT with retrieved context + citation formatting
β
βΌ
Answer + Sources β Cited response with filing, section, and date metadata
| Module | Responsibility |
|---|---|
SECDataFetcher |
Fetches 10-K / 10-Q filings from SEC EDGAR with caching |
DocumentProcessor |
Chunks filings while preserving section structure |
QueryProcessor |
Parses intent, tickers, dates, and filing types from queries |
VectorStore |
Manages semantic embeddings via ChromaDB |
HybridRetriever |
Combines semantic + keyword search with financial domain scoring |
AnswerGenerator |
Synthesizes cited answers via OpenAI GPT |
| Type | Example |
|---|---|
| Single company | "What are Apple's main risk factors?" |
| Comparative | "Compare Apple and Microsoft's R&D spending" |
| Temporal / trend | "How has Amazon's revenue guidance changed over time?" |
| Multi-dimensional | "Apple's 2022 10-K cybersecurity risks" |
| Strategic | "How are major tech companies approaching AI investment?" |
- Python 3.8+
- OpenAI API key
git clone https://github.com/overcastbulb/sec-filings-qa-system.git
cd sec-filings-qa-system
python -m venv .venv
source .venv/bin/activate # macOS/Linux
# .\.venv\Scripts\activate # Windows
pip install -r requirements.txtexport OPENAI_API_KEY="your-openai-api-key"
# Optional
export SEC_DATA_DIR="./data"
export SEC_CACHE_DIR="./data/cache"python sec_qa_system.pyfrom sec_qa_system import SECQASystem
qa = SECQASystem()
# Ingest filings for one or more tickers
qa.ingest_company_data(['AAPL', 'MSFT', 'GOOGL'])
# Ask a question
result = qa.query("What are Apple's primary revenue drivers?")
print(result['answer'])
print(f"Sources: {len(result['sources'])} documents")Query: "What are the primary revenue drivers for major technology companies?"
Answer:
Technology companies demonstrate diverse revenue streams including subscription
services, cloud computing, advertising, and hardware sales...
Sources: 8 documents across 4 companies
Coverage: 3 companies, 2 filing types (10-K, 10-Q)
Response time: 3.2 seconds
# Increase relevance weight for specific terms
retriever.financial_terms.update({
'artificial intelligence': 2.5,
'cloud revenue': 2.7,
'operating margin': 2.3
})retriever.section_weights['risk']['Cybersecurity'] = 2.8
retriever.section_weights['financial']['Revenue Recognition'] = 3.2from sec_qa_system import SECQASystem, SystemBenchmark
qa = SECQASystem()
benchmark = SystemBenchmark(qa)
results = benchmark.run_evaluation_suite()
print(results['evaluation_summary'])sec-filings-qa-system/
βββ sec_qa_system.py # Main pipeline β ingest, query, answer
βββ hybrid_retrieval.py # Hybrid retrieval engine
βββ requirements.txt
βββ docs/
β βββ architecture_overview.md
β βββ technical_documentation.md
β βββ setup_guide.md
β βββ evaluation_results.md
βββ examples/
β βββ sample_queries.py
β βββ benchmark_tests.py
β βββ demo_results.md
βββ tests/
β βββ test_query_processor.py
β βββ test_retrieval.py
βββ data/ # Filing cache (gitignored)
| Metric | Value |
|---|---|
| Avg. response time | 2β5 seconds |
| Documents retrieved per query | 5β15 |
| Chunk capacity | 100K+ efficiently |
| Memory per 1,000 chunks | ~100MB |
- Streamlit UI for non-developer users
- Support for 8-K filings (earnings releases, material events)
- Local LLM mode (Ollama / LLaMA) β no OpenAI dependency
- Docker setup for one-command deployment
- Async batch ingestion for large ticker lists
git checkout -b feature/your-feature
git commit -m "feat: describe your change"
git push origin feature/your-feature
# Open a Pull RequestThis project is licensed under the MIT License.
Report a Bug Β· Technical Docs Β· Architecture