MeridianFormer is a state-of-the-art 283M-parameter transformer-based language model utilizing a Sparse Mixture-of-Experts (SMoE) architecture. The project is specifically engineered for financial intelligence, high-precision quantitative reasoning, and algorithmic math tasks. It introduces a innovative training paradigm optimized for continuous, hourly execution on standard CPU runners through the application of Elastic Weight Consolidation (EWC) to prevent catastrophic forgetting.
This project is developed by Ishaan M. . All source code has been formatted and edited with the assistance of Antigravity AI. Yes this is my like 50th implementation of this project, but we're pushing for architectural excellence with this one.
MeridianFormer departs from standard dense architectures to maximize knowledge capacity while minimizing computational overhead during inference and training.
The model employs a sparse gateway system consisting of 8 distinct experts per MoE layer. Utilizing a load-balanced Top-2 routing mechanism, only 196M parameters are active per token. This architectural choice provides a 1.44x increase in computational efficiency over dense models of equivalent capacity. The router uses a noisy top-k mechanism to ensure expert diversity and is regularized by an auxiliary load-balancing loss to prevent expert collapse.
To optimize memory utilization and KV-cache efficiency, MeridianFormer implements Grouped Query Attention. With a ratio of 12 Query heads to 4 Key-Value heads, the model significantly reduces the memory bandwidth required for long-context generation. This allows for higher throughput on CPU-bound environments where memory bandwidth is often the primary bottleneck.
The project utilizes Rotary Position Embeddings with a base frequency (theta) of 500,000. This configuration enables superior relative position awareness and supports context windows of up to 2,048 tokens. The high theta value is specifically tuned for the high-frequency nature of financial data tokens, ensuring stable embeddings even at the edges of the context window.
Following state-of-the-art transformer standards (e.g., Llama 3), MeridianFormer replaces standard GELU activations with SwiGLU. The gated linear unit, combined with the SiLU activation function, provides improved gradient flow and representational capacity. The intermediate size is tuned to 1792 for dense layers and 896 for experts to maintain a balanced parameter distribution.
We utilize Root Mean Square Layer Normalization (RMSNorm) for faster computation and stable training. Additionally, input and output embeddings are tied (shared weights), which reduces the total parameter count by ~116M while maintaining superior performance in causal language modeling tasks.
Standard tokenization strategies often fail to capture the quantitative magnitude of numeric data. MeridianFormer introduces a learned Numeracy Encoding layer that injects magnitude-aware signals into the hidden states. By mapping numeric tokens to their respective logarithmic scales, the model develops an inherent understanding of quantitative relationships (e.g., the order-of-magnitude difference between pricing levels), rather than treating numbers as purely semantic strings.
MeridianFormer is designed for perpetual learning. To mitigate the issue of catastrophic forgetting during frequent, incremental training runs, the system implements Elastic Weight Consolidation.
- Fisher Information Matrix: Following each training iteration, the model computes the diagonal Fisher matrix to identify weights that are critical to previously acquired knowledge.
- Penalty Mechanism: During subsequent runs, a L2-style penalty is applied to the loss function for changes made to these critical weights, ensuring the model retains its "core" financial reasoning while adapting to new market data.
| Feature | Specification |
|---|---|
| Total Parameters | 283,121,536 |
| Active Parameters | 196,417,408 |
| Layers | 14 (Alternating Dense and MoE) |
| Hidden Dimension | 768 |
| Attention Heads | 12 Query / 4 KV |
| Expert Count | 8 (Top-2 Activated) |
| FFN Intermediate Size | 1792 (Dense) / 896 (Expert) |
| Normalization | RMSNorm (eps=1e-6) |
| Weight Tying | Tied Embeddings (Input/Output) |
| Repo | MeridianAlgo/FinAI-Lite |
The training pipeline uses a weighted streaming curriculum to ensure a balanced foundation in language, mathematics, and finance:
- FinanceAlpaca (40%): Specialized instruction sets for financial analysis, portfolio management, and market sentiment.
- OpenMathInstruct-2 (30%): Advanced math reasoning to sharpen the model's logic and calculation capabilities.
- FineWeb-Edu (30%): High-quality educational data to maintain general semantic proficiency and world knowledge.
Data is streamed via a round-robin generator with support for state persistence (dataset_state.json), ensuring that the model never trains on the same sample twice across hourly runs.
The repository includes a comprehensive GitHub Actions ecosystem (.github/workflows/) that manages the model's lifecycle:
- Hourly Training: Automated execution on standard runners, managing checkpoint synchronization with HuggingFace Hub repository
MeridianAlgo/FinAI-Lite. - HuggingFace Integration: Seamless weight management using the
safetensorsformat for secure and efficient distribution. - Validation: Integrated CI/CD suite performing architecture smoke tests, parameter validation, and unit testing on every commit.
Dependencies are managed via requirements.txt.
python -m pip install -r requirements.txtUse the included script to verify the MoE efficiency and parameter distribution.
python scripts/count_params.pyThe train.py script supports environment variables for configuration.
# Smoke test (validate architecture)
SMOKE_TEST=1 python train.py
# Full training iteration
python train.pypython scripts/test_generation.pyThis project is licensed under the MIT License.