Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -357,4 +357,4 @@ Current focus:
- add stronger market metrics
- expand pandas-based analytics workflows
- improve dashboard usefulness without adding unnecessary chart noise
- document metric definitions, assumptions and data-source behavior
- document metric definitions, assumptions and data-source behavior
12 changes: 12 additions & 0 deletions docs/forecast_research.md
Original file line number Diff line number Diff line change
@@ -1,38 +1,50 @@
# Research: First Forecasting Approach for Market Time Series

## 1. Realistic First Prediction Task for ARGUS

A realistic first prediction task for ARGUS is **next-day exchange-rate movement** or **trend direction**. Predicting the exact next value (point forecast) is generally much harder and often less useful for trading/signal workflows than predicting the direction of the movement (up/down). A directional classification task serves as a simple, actionable signal for basic workflows.

## 2. Baseline Methods to Implement First

Before jumping into complex models, the following baselines should be implemented to evaluate the added value of any machine learning model:

- **Naive last-value forecast**: The prediction for the next period is exactly the value from the current period. This is surprisingly hard to beat in random walk-like financial time series.
- **Moving average forecast**: A simple rolling average to predict the next value or determine trend direction.
- **Simple linear regression**: To capture basic linear trends over a given historical window.

## 3. Libraries: NumPy, pandas, or scikit-learn?

The first implementation should use **pandas** and **scikit-learn**:

- **pandas**: Excellent for time-series manipulation, rolling windows, lagging features, and handling missing data.
- **scikit-learn**: Offers robust implementations of simple models (e.g., Linear Regression, Logistic Regression for direction) and provides standardized metrics and cross-validation tools designed for time series (e.g., `TimeSeriesSplit`).

## 4. Evaluation Metrics

For the initial approaches, we should focus on:

- **Directional accuracy**: The percentage of times the model correctly predicts the direction of the price movement (up vs down). This is often more relevant than magnitude errors.
- **MAE (Mean Absolute Error)**: If point forecasting is used, MAE is more robust to outliers than RMSE and provides a linear penalty for errors.
- **RMSE (Root Mean Squared Error)**: Useful to penalize larger errors more heavily, but should be secondary to directional accuracy for basic signal generation.

## 5. Why is LSTM not the first implementation step?

LSTMs are highly complex, require a large amount of well-structured data to train effectively without overfitting, and are notoriously difficult to tune. For financial time series, which suffer from low signal-to-noise ratios, an LSTM is likely to overfit the training data or collapse to predicting the last known value. Starting with an LSTM obscures whether the underlying data has any predictive power and sets a high barrier for debugging and infrastructure.

## 6. Prerequisites for an LSTM Ticket

Before considering LSTMs or other deep learning approaches, the following must be established:

- A reliable data ingestion and preprocessing pipeline.
- Established baseline performance metrics (e.g., a naive model and a linear regression model) to compare against.
- Sufficient historical data size.
- A robust backtesting and cross-validation framework to ensure the LSTM isn't just memorizing data or overfitting.
- Hardware/infrastructure to support longer training times and hyperparameter tuning.

## 7. Recommended First Implementation Approach

**Recommendation**: Start with **directional trend prediction** (predicting whether the next value is higher or lower than the current value) using a simple **Logistic Regression** model via **scikit-learn**.

- Use **pandas** to create basic lagged features (e.g., previous returns, moving averages).
- Evaluate using **directional accuracy**.
- Compare performance strictly against a **naive momentum** (predicting the trend continues) or **majority-class** baseline.
File renamed without changes.
53 changes: 29 additions & 24 deletions docs/research-databases-and-storage.md
Original file line number Diff line number Diff line change
Expand Up @@ -110,7 +110,7 @@ PostgreSQL should be introduced later when ARGUS moves toward a server-based or
## Local, Server and Cloud Options

| Option | Meaning | Fit Now | Fit Later |
|---|---|---:|---:|
| --- | --- | ---: | ---: |
| Local storage | Database runs locally inside or next to the project | High | High |
| Server database | Database runs as a separate service, for example PostgreSQL | Medium | High |
| Cloud storage/database | Managed storage or database in the cloud | Low | High |
Expand Down Expand Up @@ -203,19 +203,24 @@ ARGUS should not use a narrow `date | value` table as the main market-data model

That would work for simple exchange rates, but it would become limiting once ARGUS adds stocks, ETFs, indices or broader market APIs.

The first model should focus on three tables:
The first model should focus on three related entities:

```text
data_sources
instruments
price_bars
```

> [!NOTE]
> The fields below describe the future database-oriented structure.
> Technical fields such as `id`, `instrument_id`, `source_id`, `created_at` and `updated_at` are expected to appear in the database layer.
> Internal Python models may reference related objects directly, for example `source` and `instrument`, before database IDs exist.

### data_sources

Stores where data came from.

Recommended first fields:
Recommended first database fields:

```text
id
Expand All @@ -226,13 +231,13 @@ created_at
updated_at
```

Example:
Example internal/source records:

| name | provider_kind | requires_api_key |
|---|---|---:|
| Frankfurter | fx_rates | false |
| yfinance | market_prices | false |
| FRED | macro_data | true |
| name | provider_kind | requires_api_key |
| ---------------- | ------------- | ---------------: |
| ExchangeRate API | fx_rates | true |
| yfinance | market_prices | false |
| FRED | macro_data | true |

### instruments

Expand All @@ -246,7 +251,7 @@ Examples:
* S&P 500
* BTC-USD

Recommended first fields:
Recommended first database fields:

```text
id
Expand All @@ -261,19 +266,19 @@ created_at
updated_at
```

Example:
Example instrument records:

| symbol | name | asset_class | currency | exchange | base_currency | quote_currency |
|---|---|---|---|---|---|---|
| EUR/USD | Euro / US Dollar | fx | null | null | EUR | USD |
| AAPL | Apple Inc. | stock | USD | NASDAQ | null | null |
| SPY | SPDR S&P 500 ETF | etf | USD | NYSE Arca | null | null |
| symbol | name | asset_class | currency | exchange | base_currency | quote_currency |
| ------- | ---------------- | ----------- | -------- | --------- | ------------- | -------------- |
| EUR/USD | Euro / US Dollar | fx | null | null | EUR | USD |
| AAPL | Apple Inc. | stock | USD | NASDAQ | null | null |
| SPY | SPDR S&P 500 ETF | etf | USD | NYSE Arca | null | null |

### price_bars

Stores historical market data in an OHLCV-ready structure.

Recommended first fields:
Recommended first database fields:

```text
id
Expand All @@ -291,16 +296,16 @@ created_at
updated_at
```

For Frankfurter, the exchange rate can be stored in `close`.
FX-style exchange-rate data can be represented as a price bar by storing the rate in `close`.

The other OHLCV fields can stay empty until ARGUS uses data sources that provide them.

Example:
Example price bar records shown with joined source and instrument information for readability:

| symbol | timestamp | timeframe | open | high | low | close | adjusted_close | volume |
|---|---|---|---:|---:|---:|---:|---:|---:|
| EUR/USD | 2024-01-02 | 1d | null | null | null | 1.095 | null | null |
| AAPL | 2024-01-02 | 1d | 187.15 | 188.44 | 183.89 | 185.64 | 184.25 | 50200000 |
| source | instrument | timestamp | timeframe | open | high | low | close | adjusted_close | volume |
| -------- | ---------- | ---------- | --------- | -----: | -----: | -----: | -----: | -------------: | -------: |
| yfinance | EUR/USD | 2024-01-02 | 1d | null | null | null | 1.095 | null | null |
| yfinance | AAPL | 2024-01-02 | 1d | 187.15 | 188.44 | 183.89 | 185.64 | 184.25 | 50200000 |

---

Expand Down Expand Up @@ -332,7 +337,7 @@ Later sprints can expand the storage layer step by step.
Possible later additions:

| Future Area | Possible Additions |
|---|---|
| --- | --- |
| Better source mapping | source-specific symbols, provider metadata |
| Watchlists | user-selected instruments |
| Reports | generated report metadata and history |
Expand Down
34 changes: 34 additions & 0 deletions src/argus/domain/internal_models.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
from dataclasses import dataclass
from datetime import date


@dataclass
class DataSource:
name: str
provider_kind: str
requires_api_key: bool = False


@dataclass
class Instrument:
symbol: str
name: str
asset_class: str
currency: str | None = None
exchange: str | None = None
base_currency: str | None = None
quote_currency: str | None = None


@dataclass
class PriceBar:
source: DataSource
instrument: Instrument
timestamp: date
timeframe: str
close: float
open: float | None = None
high: float | None = None
low: float | None = None
adjusted_close: float | None = None
volume: float | None = None
101 changes: 101 additions & 0 deletions tests/test_internal_models.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,101 @@
from argus.domain.internal_models import DataSource, Instrument, PriceBar
from datetime import date


def test_data_source_can_be_created() -> None:
source = DataSource(
name="yfinance",
provider_kind="fx_rates",
)

assert source.name == "yfinance"
assert source.provider_kind == "fx_rates"
assert source.requires_api_key is False


def test_instrument_can_be_created() -> None:
instrument = Instrument(
symbol="EUR/USD",
name="Euro / US Dollar",
asset_class="fx",
base_currency="EUR",
quote_currency="USD",
)

assert instrument.symbol == "EUR/USD"
assert instrument.name == "Euro / US Dollar"
assert instrument.asset_class == "fx"
assert instrument.base_currency == "EUR"
assert instrument.quote_currency == "USD"
assert instrument.currency is None
assert instrument.exchange is None


def test_rate_bar_can_be_created() -> None:
source = DataSource(
name="yfinance",
provider_kind="fx_rates",
)

instrument_rate = Instrument(
symbol="EUR/USD",
name="Euro / US Dollar",
asset_class="fx",
base_currency="EUR",
quote_currency="USD",
)

price_bar = PriceBar(
source=source,
instrument=instrument_rate,
timestamp=date(2026, 1, 1),
timeframe="1d",
close=1.89,
)

assert price_bar.source == source
assert price_bar.instrument == instrument_rate
assert price_bar.timestamp == date(2026, 1, 1)
assert price_bar.timeframe == "1d"
assert price_bar.close == 1.89
assert price_bar.open is None
assert price_bar.high is None
assert price_bar.low is None
assert price_bar.adjusted_close is None
assert price_bar.volume is None


def test_stock_ohlcv_bar_can_be_created() -> None:
source = DataSource(
name="yfinance",
provider_kind="market_prices",
)

instrument = Instrument(
symbol="AAPL",
name="Apple Inc.",
asset_class="stock",
currency="USD",
exchange="NASDAQ",
)

price_bar = PriceBar(
source=source,
instrument=instrument,
timestamp=date(2026, 1, 1),
timeframe="1d",
open=187.15,
high=188.44,
low=183.89,
close=185.64,
adjusted_close=184.25,
volume=50_200_000,
)

assert price_bar.instrument.symbol == "AAPL"
assert price_bar.open == 187.15
assert price_bar.high == 188.44
assert price_bar.low == 183.89
assert price_bar.close == 185.64
assert price_bar.adjusted_close == 184.25
assert price_bar.volume == 50_200_000
Loading