End-to-End Abstractive NLP Pipeline (BART)

Complete MLOps pipeline for Abstractive Text Summarization using Hugging Face's BART-base model, fine-tuned on the scientific_papers/arxiv dataset. Features include structured training, ROUGE/BERTScore evaluation, and deployment to the Hugging Face Hub and Spaces.

Project Highlights

This project goes beyond a simple fine-tuning task, showcasing expertise in building reproducible, production-ready NLP systems:

Domain Specialization: Model is robustly fine-tuned on the scientific_papers/arxiv corpus (article $\rightarrow$ abstract), ensuring high-quality, domain-appropriate summarization.
Structured MLOps Pipeline: Features a clean src/ package structure with explicit stages (data ingestion, tokenization, training, evaluation, artifact management).
Reproducibility & Stability: Uses deterministic seeds, implements disk-cached datasets (for efficiency), pins dependency requirements, and provides specific stability flags for multi-OS support.
Comprehensive Evaluation: Utilizes both ROUGE-1/2/L (lexical overlap) and the advanced BERTScore (semantic similarity) to provide a complete view of model performance.
Deployment Ready: The fine-tuned model is versioned and hosted on the Hugging Face Hub (ashbix23/bart-summariser-model) and consumed by a public Gradio Space demo.

Quick Links (Deployment Artifacts)

Resource	Value	Link (Username: `ashbix23`)
Hosted Model	Model Checkpoint & Versioning	`ashbix23/bart-summariser-model`
Interactive Demo	Live Gradio Deployment	`ashbix23/text-summarisation-hf`

Evaluation Results (Validation Subset)

Evaluation was performed on a validation subset after fine-tuning on a 40k train / 4k val slice, a common practice for rapid iteration and establishing a strong baseline.

Metric	Score	Interpretation
ROUGE-1	36.80	High lexical overlap with human abstracts.
ROUGE-2	13.29	Good performance on capturing key bi-grams/phrases.
ROUGE-L	22.33	Strong longest common subsequence overlap.
BERTScore-F1	85.17	Confirms high semantic similarity between prediction and reference.

Repository Structure

.
├── README.md
├── requirements.txt
├── app.py                      
├── notebooks/
│   ├── 01_dataset_exploration.ipynb
│   ├── 02_model_training.ipynb
│   └── 03_model_evaluation.ipynb
└── src/
    ├── __init__.py
    ├── data_loader.py         # download → tokenize → cache (project-root paths)
    ├── model.py               # tokenizer/model getters
    ├── train.py               # CLI training entry (optional; notebooks preferred)
    ├── evaluate.py            # scriptable evaluation
    ├── utils.py               # shared helpers
    └── seed_utils.py          # deterministic seeds

Outputs (created at runtime):

outputs/
├── model/                     # final exported checkpoint (for local eval)
└── eval/
    ├── metrics.json
    └── examples.json

Cache:

data/cache/
└── scientific_papers_arxiv_bartbase_1024_256_{raw,tok}/

Important: All paths resolve relative to the project root, not the notebook working directory. This prevents duplicate caches under notebooks/.

🛠️ Environment and Setup

Targeted for Python 3.10+. Supports standard CPU and Apple Silicon (MPS).

Installation:
```
pip install -r requirements.txt
```
Stability Flags (For macOS/Jupyter): If encountering import stalls or parallelism issues, set these environment variables before importing transformers or torch:
```
import os
os.environ["TOKENIZERS_PARALLELISM"] = "false"
os.environ["HF_DATASETS_DISABLE_MULTIPROCESSING"] = "1"
# ... (other stability flags)
```

Data

Dataset: scientific_papers → arxiv split
Fields: article (input), abstract (target)
Default lengths: max_input_len=1024, max_target_len=256

The loader will download and cache both raw and tokenized datasets to data/cache/ on first run.

Notebooks (Preferred Workflow)

The notebooks are the authoritative way to run the project end-to-end.

1) 01_dataset_exploration.ipynb

Loads raw + tokenized datasets.
Prints samples and plots token length distributions (articles vs abstracts).
Verifies cache integrity.

2) 02_model_training.ipynb

Loads BART-base and tokenized data.
Uses Seq2SeqTrainer with MPS/CPU-friendly defaults.
Example fast path: subset training (40k train / 4k val) with 1–2 epochs for a strong baseline.
Saves the final model to outputs/model/.

Key training args (representative):

per_device_train_batch_size=1
per_device_eval_batch_size=1
gradient_accumulation_steps=4
learning_rate=5e-5
num_train_epochs=2
predict_with_generate=True
save_strategy="epoch"
load_best_model_at_end=True

3) 03_model_evaluation.ipynb

Loads the saved checkpoint from outputs/model/.
Generates summaries for a validation subset (configurable).
Computes ROUGE and BERTScore; writes JSON artifacts.
Plots a simple ROUGE bar chart.

Scripts (Optional CLI)

Notebooks are recommended. If you prefer CLI, you can run:

python -m src.train

If you encounter libc++abi or mutex issues on macOS with CLI runs, prefer the notebooks or add the environment flags shown in the Environment section.

Evaluation from CLI:

python -m src.evaluate

Model & Inference Notes

Backbone: facebook/bart-base
Tokenizer: BPE (uses vocab.json + merges.txt)
Generation: Beam search by default (num_beams=4, early_stopping=True)
Input limits: Truncated to 1024 tokens; summaries targeted at ~128 tokens (min 30, max 256 configurable)

Beam Search Width (num_beams)

1: greedy, fastest, can be dull
4–6: balanced quality (default)
8+: thorough but slower; may bias toward generic phrasing

Reproducibility

Seeding: seed_utils.set_seed(42) applied across NumPy/PyTorch.
Caching: Raw and tokenized datasets saved under data/cache/, keyed by model + length settings.
Artifacts: Metrics and examples saved under outputs/eval/.

To reset and rebuild caches cleanly:

import shutil, os
shutil.rmtree("data/cache", ignore_errors=True)
os.makedirs("data/cache", exist_ok=True)

Known Pitfalls & Fixes

Jupyter import hang on macOS: Set the environment flags under Environment → Stability flags before imports.
Duplicate caches under notebooks/: Caused by relative paths; this repo resolves paths from project root to prevent it.
Transformers/Tokenizers version conflicts: Ensure transformers>=4.40.1 and remove overly pinned tokenizers versions.
Long training times on CPU: Start with a subset (e.g., 40k/4k) to validate pipeline; scale up on GPU later.

Future Work and Technical Roadmap

This section outlines planned enhancements to further industrialize the summarization pipeline and explore advanced GenAI techniques.

1. Algorithmic Extensions (Deepening Mastery)

Feature	Technical Goal
Full Parameter Fine-Tuning (LoRA/QLoRA)	Implement Parameter-Efficient Fine-Tuning (PEFT) using LoRA or QLoRA to reduce memory footprint and training time.
Integrate T5 or Pegasus	Benchmark an alternative model architecture (e.g., T5 or Pegasus) on the same task/dataset.

2. Robustness and Software Engineering

Feature	Technical Goal
MLflow Experiment Tracking	Integrate MLflow to log training runs, hyperparameters, and ROUGE/BERTScore metrics for systematic experiment comparison.
Custom Data Preprocessing	Add PDF parsing or LaTeX stripping to allow direct ingestion of raw scientific papers (not just pre-cleaned Hugging Face data).

Citation

If you use this code/model in academic work, please cite BART and the Hugging Face ecosystem:

Lewis et al., 2020. BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension.
Hugging Face Transformers, Datasets, and Evaluate libraries.

License

This project is released under the MIT License. See LICENSE for details.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

End-to-End Abstractive NLP Pipeline (BART)

Project Highlights

Quick Links (Deployment Artifacts)

Evaluation Results (Validation Subset)

Repository Structure

🛠️ Environment and Setup

Data

Notebooks (Preferred Workflow)

1) 01_dataset_exploration.ipynb

2) 02_model_training.ipynb

3) 03_model_evaluation.ipynb

Scripts (Optional CLI)

Model & Inference Notes

Reproducibility

Known Pitfalls & Fixes

Future Work and Technical Roadmap

1. Algorithmic Extensions (Deepening Mastery)

2. Robustness and Software Engineering

Citation

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
app		app
notebooks		notebooks
src		src
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

End-to-End Abstractive NLP Pipeline (BART)

Project Highlights

Quick Links (Deployment Artifacts)

Evaluation Results (Validation Subset)

Repository Structure

🛠️ Environment and Setup

Data

Notebooks (Preferred Workflow)

1) 01_dataset_exploration.ipynb

2) 02_model_training.ipynb

3) 03_model_evaluation.ipynb

Scripts (Optional CLI)

Model & Inference Notes

Reproducibility

Known Pitfalls & Fixes

Future Work and Technical Roadmap

1. Algorithmic Extensions (Deepening Mastery)

2. Robustness and Software Engineering

Citation

License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages