Skip to content

anto18671/lumenspark2

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🌌 Lumenspark2: Next-Gen Lightweight Transformer

Lumenspark2 is the second-generation implementation of Lumenspark, a lightweight transformer for efficient large-scale language modeling. It integrates modern architectural components, Hugging Face’s training ecosystem, and parameter-efficient fine-tuning methods to provide a flexible, research-friendly framework.


🚀 Features

  • 🔥 Modern Transformer Architecture

    • Rotary Position Embeddings (RoPE)
    • RMSNorm normalization
    • SwiGLU feed-forward networks
    • Efficient scaled-dot-product attention (SDPA)
  • Training Framework

    • Hugging Face Trainer integration
    • Streaming dataset support (FineWeb-Edu)
    • Gradient accumulation & mixed precision (bf16)
    • Custom callback for loss plots and inline text generation
  • 🧩 Extensible & Modular

    • LoRA adapters for efficient fine-tuning
    • Dynamic sequence chunking collator
    • Configurable via LumensparkConfig
  • 📊 Evaluation & Monitoring

    • Live loss plotting (training_loss_plot.png)
    • Text generation evaluation during training
    • Parameter counting utility

📦 Installation

Clone and install dependencies:

git clone https://github.com/anto18671/lumenspark2.git
cd lumenspark2
pip install -r requirements.txt

Dependencies:

  • torch
  • transformers
  • datasets
  • safetensors
  • huggingface_hub
  • matplotlib

🏗️ Model Architecture

  • Config Parameters (LumensparkConfig):

    • seq_length: 1536
    • d_model: 1024
    • n_layers: 12
    • n_heads: 16
    • ffn_mult: 4.0
    • dropout: 0.1
    • rope_theta: 10,000
    • adapter_rank: 0 (LoRA disabled by default)
  • Core Components:

    • Token embeddings (tied with LM head)
    • Transformer blocks with RMSNorm + RoPE + SDPA
    • SwiGLU feed-forward networks
    • Causal LM head

📝 Training

Run training with:

python train.py

Default hyperparameters:

  • Batch size: 8
  • Gradient accumulation: 20
  • Learning rate: 1e-4
  • Weight decay: 1e-2
  • Dataset: FineWeb-Edu (streaming)
  • Steps: 10,000 (via MAX_STEPS)

Training outputs:

  • Loss curvestraining_loss_plot.png
  • Generated samples printed at evaluation intervals

🔍 Generation

Lumenspark2 has a built-in .generate() method supporting top-k, top-p, temperature, and repetition penalty.

from lumenspark_model import LumensparkModel, LumensparkConfig
from transformers import GPT2Tokenizer

tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
tokenizer.pad_token = tokenizer.eos_token
config = LumensparkConfig()
model = LumensparkModel(config, tokenizer=tokenizer)

prompt = "The year is 2050, and humans have colonized Mars."
print(model.generate(prompt, max_length=64, top_p=0.9, temperature=0.7))

📊 Parameter Counting

from utils import count_parameters
count_parameters(model)

Outputs total, trainable, and non-trainable parameter counts.


📂 Project Structure

lumenspark2/
├── train.py              # Training loop with Hugging Face Trainer
├── lumenspark_model.py   # Transformer architecture, config, generate()
├── utils.py              # Helper functions (collator, parameter counting)
├── requirements.txt      # Dependencies
├── README.md             # Documentation
└── LICENSE               # MIT License

📜 License

MIT License – see LICENSE.


🙌 Acknowledgments

  • Hugging Face transformers & datasets
  • FineWeb-Edu dataset
  • OpenAI GPT-2 tokenizer

About

Lumenspark2 is the second-generation implementation of Lumenspark, a lightweight transformer for efficient large-scale language modeling.

Topics

Resources

License

Stars

Watchers

Forks

Contributors