🧠 NLP Preprocessing & Feature Extraction — A to Z

A production-ready, modular text preprocessing and feature extraction pipeline for Natural Language Processing — designed for clarity, extensibility, and real-world use.

📌 Overview

Raw text from the real world is messy — it contains HTML tags, slang, emojis, typos, URLs, and inconsistent casing. Before feeding text into any machine learning or NLP model, it must be cleaned, normalized, and transformed into a numerical representation.

This pipeline provides a comprehensive, step-by-step, modular workflow that takes raw, noisy text through three major stages — cleaning, preprocessing, and feature extraction — turning it into model-ready input. Each function is independently usable, well-documented, and easy to customize for your domain.

✅ Designed For

Text classification & sentiment analysis
Topic modeling & document clustering
Machine translation & text summarization
Social media analysis & content moderation
Information retrieval & semantic search
Any downstream NLP or ML task

🗺️ Full Pipeline Workflow

Raw Text Input
      │
      ▼
┌──────────────────────────────────────────────────────────────────┐
│  🧹 STAGE 1 — TEXT CLEANING                                      │
│                                                                  │
│   1.  Remove HTML Tags              (strip web markup)           │
│   2.  Case Folding                  (lowercase)                  │
│   3.  Expand Contractions           (don't → do not)             │
│   4.  Remove URLs                   (strip links)                │
│   5.  Remove Non-ASCII Characters   (strip encoding noise)       │
│   6.  Handle Emojis                 (remove or encode)           │
│   7.  Handle Numbers                (normalize digits)           │
│   8.  Remove Punctuation            (strip symbols)              │
│   9.  Remove Extra Whitespace       (clean spaces)               │
│   10. Text Normalization            (fix typos/slang)            │
│   11. Spelling Correction           (fix misspellings)           │
└──────────────────────────────────────────────────────────────────┘
      │
      ▼
┌──────────────────────────────────────────────────────────────────┐
│  ⚙️ STAGE 2 — TEXT PREPROCESSING                                 │
│                                                                  │
│   12. Tokenization                  (split into words)           │
│   13. Remove Stopwords              (filter noise words)         │
│   14. POS Tagging                   (grammatical annotation)     │
│   15. Lemmatization                 (base word form)             │
│   16. Stemming                      (root word form)             │
│   17. Rare Word Removal             (trim vocabulary)            │
│   18. Language Detection            (identify input language)    │
└──────────────────────────────────────────────────────────────────┘
      │
      ▼
┌──────────────────────────────────────────────────────────────────┐
│  🔢 STAGE 3 — FEATURE EXTRACTION                                 │
│                                                                  │
│   19. Bag of Words (BoW / CountVectorizer)                       │
│   20. TF-IDF                                                     │
│   21. Word2Vec                                                   │
│   22. GloVe                                                      │
│   23. FastText                                                   │
│   24. BERT (Contextual Embeddings)                               │
└──────────────────────────────────────────────────────────────────┘
      │
      ▼
Model-Ready Numerical Representation

Note: Not every step is required for every task. Use this pipeline as a menu — pick what your task needs. See the Step Selection Guide for task-specific recommendations.

📚 Table of Contents

🧹 Stage 1 — Text Cleaning

#	Step	Purpose
1	Remove HTML Tags	Strip web markup
2	Case Folding	Normalize casing
3	Expand Contractions	Standardize short forms
4	Remove URLs	Remove hyperlinks
5	Remove Non-ASCII Characters	Strip encoding noise
6	Handle Emojis	Remove or encode emojis
7	Handle Numbers	Normalize digits
8	Remove Punctuation	Strip symbols
9	Remove Whitespace	Clean spacing
10	Text Normalization	Fix slang and typos
11	Spelling Correction	Fix misspellings

⚙️ Stage 2 — Text Preprocessing

#	Step	Purpose
12	Tokenization	Split into tokens
13	Remove Stopwords	Filter noise words
14	POS Tagging	Grammatical annotation
15	Lemmatization	Reduce to base form
16	Stemming	Reduce to root form
17	Rare Word Removal	Trim vocabulary
18	Language Detection	Identify input language

🔢 Stage 3 — Feature Extraction

#	Step	Purpose
19	Bag of Words	Sparse word-count matrix
20	TF-IDF	Weighted word importance
21	Word2Vec	Dense semantic embeddings
22	GloVe	Pre-trained global vectors
23	FastText	Subword-aware embeddings
24	BERT	Contextual deep embeddings
📊	Comparison	Choose the right method

Other


💡	spaCy Alternative Pipeline
⚙️	Full Pipeline Example
📊	Step Selection Guide

🛠️ Requirements

Install All Dependencies

# Core NLP libraries
pip install nltk contractions beautifulsoup4 emoji spacy textblob

# Word embeddings & classical ML
pip install gensim scikit-learn

# Deep learning & BERT
pip install transformers tensorflow tensorflow-hub tensorflow-text

# Language detection
pip install pyicu pycld2 polyglot

# Download NLTK corpora
python -m nltk.downloader punkt stopwords wordnet averaged_perceptron_tagger omw-1.4

# Download spaCy model
python -m spacy download en_core_web_sm

Verified Library Versions

Library	Version
Python	3.8+
pandas	1.5+
numpy	1.23+
scikit-learn	1.2+
nltk	3.x
gensim	4.x
spaCy	3.x
transformers	4.x

🧹 Stage 1 — Text Cleaning

1. Remove HTML Tags

What it does: Strips all HTML markup (e.g., <p>, <br>, <div>) from text scraped from websites, articles, or comment systems.

Why it matters: HTML tags are structural syntax — they carry zero semantic value for NLP models. Leaving them in adds noise that confuses tokenizers and embeddings.

Example:

Input:  <p>The product is <b>amazing</b>!</p>
Output: The product is amazing!

import re
from bs4 import BeautifulSoup

def remove_html_tags(text: str) -> str:
    """Remove HTML tags using BeautifulSoup for robustness."""
    return BeautifulSoup(text, "html.parser").get_text(separator=" ")

# Lightweight regex alternative (for simple cases only)
def remove_html_tags_regex(text: str) -> str:
    return re.sub(r'<[^>]+>', '', text)

⚠️ Prefer BeautifulSoup over regex for HTML — regex can fail on malformed or nested tags.

Name		Name	Last commit message	Last commit date
Latest commit History 235 Commits
examples		examples
nlp_pipeline		nlp_pipeline
Preprocessing.ipynb		Preprocessing.ipynb
README.md		README.md
activity.log		activity.log
generate.py		generate.py
nlp-preprocessing-methods-a-z.ipynb		nlp-preprocessing-methods-a-z.ipynb
requirements.txt		requirements.txt
setup_nltk.py		setup_nltk.py
start-with-text-preprocessing.ipynb		start-with-text-preprocessing.ipynb

Method	Representation	Captures Order	Captures Semantics	Handles OOV	Notes
BoW / CountVectorizer	Sparse	❌	❌	✅	Fast, interpretable, ignores word order
TF-IDF	Sparse	❌	❌	✅	Downweights common words automatically
Word2Vec	Dense	✅	✅	❌	Fixed vectors per word; fast inference
GloVe	Dense	✅	✅	❌	Excellent pre-trained baseline; no training needed
FastText	Dense	✅	✅	✅	Best for morphologically rich or noisy text
BERT	Contextual	✅	✅✅	✅	Handles polysemy; expensive but most accurate

Task	Recommended Method
Simple text classification, fast iteration	TF-IDF + Logistic Regression
Semantic similarity search	Word2Vec / GloVe sentence averages
Noisy or social media text	FastText
State-of-the-art classification accuracy	Fine-tuned BERT / RoBERTa
Low-resource or real-time inference	TF-IDF or BoW
Multilingual tasks	FastText (multilingual) or mBERT

Task	Recommended Steps
Text Classification	1, 2, 3, 4, 7, 8, 9, 12, 13, 15 + TF-IDF or Word2Vec
Sentiment Analysis	1, 2, 3, 4, 6 (encode), 8, 9, 12, 15 + FastText or BERT
Topic Modeling	1, 2, 4, 6 (remove), 7, 8, 9, 12, 13, 15, 17 + BoW or TF-IDF
Social Media NLP	1, 2, 3, 4, 6, 10, 8, 9, 12, 13, 15 + FastText
Information Retrieval	2, 4, 8, 9, 12, 13, 16 + TF-IDF
Machine Translation	1, 3, 4, 9 (minimal preprocessing) + BERT / mBERT
Named Entity Recognition	Skip step 2 (preserve casing); 1, 3, 4, 12, 14 + BERT
Semantic Similarity	1, 2, 3, 4, 8, 9, 12 + GloVe or BERT

Folders and files

Latest commit

History

Repository files navigation

🧠 NLP Preprocessing & Feature Extraction — A to Z

📌 Overview

✅ Designed For

🗺️ Full Pipeline Workflow

📚 Table of Contents

🧹 Stage 1 — Text Cleaning

⚙️ Stage 2 — Text Preprocessing

🔢 Stage 3 — Feature Extraction

Other

🛠️ Requirements

Install All Dependencies

Verified Library Versions

🧹 Stage 1 — Text Cleaning

1. Remove HTML Tags

2. Case Folding (Convert to Lowercase)

3. Expand Contractions

4. Remove URLs

5. Remove Non-ASCII Characters

6. Handle Emojis

7. Handle Numbers

8. Remove Punctuation & Special Characters

9. Remove Extra Whitespace

10. Text Normalization

11. Spelling Correction

⚙️ Stage 2 — Text Preprocessing

12. Tokenization

13. Remove Stopwords

14. POS Tagging

15. Lemmatization

16. Stemming

17. Rare Word Removal

18. Language Detection

🔢 Stage 3 — Feature Extraction

19. Bag of Words — CountVectorizer

20. TF-IDF

21. Word2Vec

22. GloVe

23. FastText

24. BERT (Contextual Embeddings)

📊 Feature Extraction Comparison

When to Use Each Method

💡 spaCy Alternative Pipeline

⚙️ Full Pipeline Example

📊 Step Selection Guide

🔁 Why Use This Pipeline?

📁 Suggested Project Structure

📚 References

Papers

Books

Useful Resources

🤝 Contributing

📄 License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages