Min-GPT-with-Arvix-Dataset

MinGPT Implementation with arXiv Dataset

Abstract

This report documents the implementation of a minimal GPT-like model (MinGPT) trained on the arXiv dataset. The model is built using PyTorch and leverages the Hugging Face transformers library for tokenization. The dataset is fetched using the arxiv Python package, and the model is trained on abstracts from arXiv papers. The goal of this project is to demonstrate the feasibility of training a small-scale transformer-based language model on a domain-specific dataset (arXiv) and to generate coherent text based on a given prompt.

1. Introduction

1.1 Background

Transformer-based models, such as GPT (Generative Pre-trained Transformer), have revolutionized natural language processing (NLP) by achieving state-of-the-art results in various tasks. These models are trained on large corpora of text and can generate human-like text. However, training such models from scratch requires significant computational resources. In this project, we implement a minimal version of GPT (MinGPT) and train it on a subset of the arXiv dataset to demonstrate the process of training a transformer-based language model.

1.2 Objectives

Fetch and preprocess the arXiv dataset.
Implement a minimal GPT-like model using PyTorch.
Train the model on arXiv abstracts.
Generate text based on a given prompt.

2. Methodology

2.1 Dataset

The arXiv dataset contains metadata and abstracts of scholarly papers across various fields. For this project, we focus on the abstracts of papers related to "machine learning." The dataset is fetched using the arxiv Python package.

2.1.1 Fetching Data

The fetch_arxiv_data function queries the arXiv API to retrieve abstracts of papers related to a specific query (e.g., "machine learning"). The abstracts are saved to a text file for further processing.

import arxiv

def fetch_arxiv_data(query="machine learning", max_results=1000):
    search = arxiv.Search(query=query, max_results=max_results, sort_by=arxiv.SortCriterion.SubmittedDate)
    papers = [result.summary for result in search.results()]
    return papers

arxiv_data = fetch_arxiv_data()

with open("arxiv_data.txt", "w", encoding="utf-8") as f:
    f.write("\n".join(arxiv_data))

2.2 Tokenization

A Byte-Pair Encoding (BPE) tokenizer is trained on the arXiv dataset using the tokenizers library. The tokenizer is saved for later use.

2.2.1 Training the Tokenizer

The tokenizer is trained on the arXiv abstracts with a vocabulary size of 50,000 tokens. Special tokens such as <s>, <pad>, </s>, <unk>, and <mask> are added to the vocabulary.

import os
from tokenizers import ByteLevelBPETokenizer

# Create the directory if it doesn't exist
os.makedirs("tokenizer", exist_ok=True)

# Train a Byte-Pair Encoding (BPE) tokenizer
tokenizer = ByteLevelBPETokenizer()

tokenizer.train(["arxiv_data.txt"], vocab_size=50_000, min_frequency=2, special_tokens=["<s>", "<pad>", "</s>", "<unk>", "<mask>"])

# Save tokenizer
tokenizer.save_model("tokenizer")

2.3 Model Architecture

The MinGPT model is a simplified version of the GPT architecture, consisting of multiple transformer blocks. Each block includes multi-head self-attention and a feed-forward neural network.

2.3.1 Transformer Block

The TransformerBlock class implements a single transformer block with multi-head self-attention, layer normalization, and a feed-forward network.

import torch
import torch.nn as nn

class TransformerBlock(nn.Module):
    def __init__(self, embed_size, heads, dropout, forward_expansion):
        super().__init__()
        self.attention = nn.MultiheadAttention(embed_dim=embed_size, num_heads=heads)
        self.norm1 = nn.LayerNorm(embed_size)
        self.norm2 = nn.LayerNorm(embed_size)
        self.feed_forward = nn.Sequential(
            nn.Linear(embed_size, forward_expansion * embed_size),
            nn.ReLU(),
            nn.Linear(forward_expansion * embed_size, embed_size)
        )
        self.dropout = nn.Dropout(dropout)

    def forward(self, value, key, query, mask):
        attn_output, _ = self.attention(query, key, value, attn_mask=mask)
        x = self.norm1(attn_output + query)
        forward_out = self.feed_forward(x)
        return self.norm2(forward_out + x)

2.3.2 GPT Model

The GPT class implements the MinGPT model, which consists of multiple transformer blocks, token embeddings, positional embeddings, and a final linear layer for token prediction.

class GPT(nn.Module):
    def __init__(self, vocab_size, embed_size=256, num_layers=8, heads=8, dropout=0.1, forward_expansion=4):
        super().__init__()
        self.embed_size = embed_size
        self.word_embedding = nn.Embedding(vocab_size, embed_size)
        self.position_embedding = nn.Embedding(512, embed_size)
        self.layers = nn.ModuleList([TransformerBlock(embed_size, heads, dropout, forward_expansion) for _ in range(num_layers)])
        self.fc_out = nn.Linear(embed_size, vocab_size)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x, mask=None):
        N, seq_length = x.shape
        positions = torch.arange(0, seq_length).expand(N, seq_length).to(x.device)
        out = self.dropout(self.word_embedding(x) + self.position_embedding(positions))

        for layer in self.layers:
            out = layer(out, out, out, mask)

        return self.fc_out(out)

2.4 Training

The model is trained using the arXiv dataset. The training loop uses the AdamW optimizer and cross-entropy loss.

2.4.1 Training Loop

The train_loop function implements the training process. The model is trained for a specified number of epochs, and the loss is logged for each batch.

import torch.optim as optim
from tqdm import tqdm

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

VOCAB_SIZE = tokenizer.vocab_size
BATCH_SIZE = 32
SEQ_LENGTH = 128
EPOCHS = 5

model = GPT(vocab_size=VOCAB_SIZE).to(device)
optimizer = optim.AdamW(model.parameters(), lr=3e-4)
loss_fn = nn.CrossEntropyLoss()

def train_loop(data, model, optimizer, loss_fn, epochs):
    model.train()

    for epoch in range(epochs):
        loop = tqdm(range(0, len(data), BATCH_SIZE), leave=True)
        for i in loop:
            batch = torch.tensor(data[i: i + BATCH_SIZE], dtype=torch.long).to(device)
            target = batch[:, 1:].contiguous()
            input_data = batch[:, :-1].contiguous()

            optimizer.zero_grad()
            output = model(input_data)

            loss = loss_fn(output.view(-1, VOCAB_SIZE), target.view(-1))
            loss.backward()
            optimizer.step()

            loop.set_description(f"Epoch {epoch+1}/{EPOCHS}")
            loop.set_postfix(loss=loss.item())

train_loop(tokenized_data["input_ids"], model, optimizer, loss_fn, EPOCHS)

2.5 Text Generation

After training, the model can generate text based on a given prompt.

2.5.1 Text Generation Function

The generate_text function generates text by iteratively predicting the next token based on the previous tokens.

import torch

def generate_text(model, tokenizer, start_text, max_len=100):
    model.eval()

    tokens = tokenizer.encode(start_text, return_tensors="pt").to(device)

    for _ in range(max_len):
        with torch.no_grad():
            output = model(tokens)
            logits = output.logits if hasattr(output, "logits") else output
            next_token = torch.argmax(logits[:, -1, :], dim=-1).unsqueeze(0)
            tokens = torch.cat([tokens, next_token], dim=1)

    return tokenizer.decode(tokens.squeeze().detach().cpu().tolist())

# Example usage:
print(generate_text(model, tokenizer, "In this paper, we propose a novel approach to"))

3. Results

3.1 Training Loss

The model was trained for 5 epochs, and the training loss decreased steadily over time. The final loss was approximately 3.19.

3.2 Generated Text

The model was able to generate text based on a given prompt. For example, when prompted with "In this paper, we propose a novel approach to," the model generated the following text:

In this paper, we propose a novel approach to
.
,
,
,
.
,
.
.
,
.
.
,
.
.
.
,
.
,
,
,
,
.
,
,
.
.
,
,
,
,
,
,
,
.
,
,
,
,
,
,
,
,
.
.
.
,
.
,

While the generated text is not entirely coherent, it demonstrates that the model has learned some patterns from the arXiv dataset.

4. Conclusion

4.1 Summary

In this project, we implemented a minimal GPT-like model (MinGPT) and trained it on the arXiv dataset. The model was able to generate text based on a given prompt, although the results were not entirely coherent. This demonstrates the feasibility of training a small-scale transformer-based language model on a domain-specific dataset.

4.2 Future Work

Larger Dataset: Train the model on a larger dataset to improve text generation quality.
Hyperparameter Tuning: Experiment with different hyperparameters (e.g., learning rate, number of layers) to optimize model performance.
Fine-Tuning: Fine-tune the model on specific arXiv categories (e.g., cs.CL for computation and language) to generate more domain-specific text.

5. References

This report provides a detailed overview of the MinGPT implementation and training process. The code and methodology can be extended for further experimentation and improvement.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
MinGPT.ipynb		MinGPT.ipynb
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Min-GPT-with-Arvix-Dataset

MinGPT Implementation with arXiv Dataset

Abstract

1. Introduction

1.1 Background

1.2 Objectives

2. Methodology

2.1 Dataset

2.1.1 Fetching Data

2.2 Tokenization

2.2.1 Training the Tokenizer

2.3 Model Architecture

2.3.1 Transformer Block

2.3.2 GPT Model

2.4 Training

2.4.1 Training Loop

2.5 Text Generation

2.5.1 Text Generation Function

3. Results

3.1 Training Loss

3.2 Generated Text

4. Conclusion

4.1 Summary

4.2 Future Work

5. References

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Min-GPT-with-Arvix-Dataset

MinGPT Implementation with arXiv Dataset

Abstract

1. Introduction

1.1 Background

1.2 Objectives

2. Methodology

2.1 Dataset

2.1.1 Fetching Data

2.2 Tokenization

2.2.1 Training the Tokenizer

2.3 Model Architecture

2.3.1 Transformer Block

2.3.2 GPT Model

2.4 Training

2.4.1 Training Loop

2.5 Text Generation

2.5.1 Text Generation Function

3. Results

3.1 Training Loss

3.2 Generated Text

4. Conclusion

4.1 Summary

4.2 Future Work

5. References

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages