ASL Hand Sign Detection is an end-to-end computer vision learning project focused on detecting American Sign Language (ASL) signs A, B, and C. The repository provides tools for custom data collection, preprocessing (including MediaPipe landmark extraction), training two model approaches (landmark-based and CNN), and running real-time inference from a webcam.
Purpose: Bridge the gap between ML theory and practical implementation by building a full pipeline: data collection → preprocessing → training → inference → evaluation.
Learning objectives:
- Custom data collection and dataset curation
- Feature engineering with MediaPipe landmarks vs end-to-end CNNs
- Model evaluation and comparison (accuracy, confusion matricies, FPS)
- Real-time inference and performance tuning
Two complementary approaches are supported:
- Landmarks-based: Use MediaPipe to extract hand landmarks and train a feedforward neural network on those features.
- CNN-based: Train a convolutional neural network directly on images.
Key technologies: OpenCV, MediaPipe, PyTorch + torchvision, Python 3.8+.
graph TD
A[Data Collection] --> B[Preprocessing]
B --> C[Landmarks Extraction]
B --> D[Image Dataset]
C --> E[Landmarks Model Training]
D --> F[CNN Model Training]
E --> G[Inference]
F --> G[Inference]
G --> H[Results & Comparison]
HandSignDetection/
├── src/
│ ├── __init__.py
│ ├── collect_data.py
│ ├── preprocess_data.py (TBD)
│ ├── train_landmarks.py (TBD)
│ ├── train_cnn.py (TBD)
│ ├── compare_models.py
│ └── utils/
│ ├── __init__.py
│ ├── config.py
│ ├── logger.py
│ ├── data.py
│ ├── mediapipe_utils.py
│ ├── metrics.py
│ └── visualization.py
├── config/
│ └── config.yaml
├── data/
│ ├── raw/
│ ├── processed/
│ └── landmarks/
├── models/
│ ├── landmarks/
│ └── cnn/
├── logs/
├── results/
├── README.md
├── requirements.txt
└── LICENSE
src/: Source code and scriptsconfig/: YAML configuration filesdata/raw/: Collected images arranged by sign labeldata/processed/: Validated and split datasetsdata/landmarks/: MediaPipe landmark arraysmodels/: Trained models (timestamped)logs/: Runtime logsresults/: Evaluation reports and visualizations
Prerequisites: Python 3.8+, webcam for data collection.
Installation:
-
Create and activate a virtual environment:
macOS/Linux:
python -m venv venv source venv/bin/activateWindows:
python -m venv venv venv\Scripts\activate -
Install dependencies:
pip install -r requirements.txt- Verify installation by importing core packages (OpenCV, MediaPipe, TensorFlow).
Auto-create directories: Running any script that calls create_directories() from src.utils.config will create required folders automatically.
- Purpose: Capture hand sign images via webcam.
- Usage: Run the script, press
A,B, orCkeys to save frames for the respective sign. Pressqto quit. - Output: Files saved to
data/raw/{sign}/.
Collect 100–200 varied images per sign (different lighting, backgrounds, hand poses).
- Purpose: Validate images, extract landmarks, and create train/val/test splits.
src/train_landmarks.py(TBD): Train a FFN on landmark features.src/train_cnn.py(TBD): Train a CNN on images.
Models are saved under models/{landmarks,cnn}/ with timestamps and metrics.
src/inference_landmarks.pyandsrc/inference_cnn.py(TBD)- Run to perform real-time predictions on webcam feed. Press
qto quit.
Run:
python -m src.compare_modelsPrerequisites:
- Both model checkpoints must exist (i.e., training has been run for both pipelines):
models/landmarks/model_latest.pthmodels/cnn/model_latest.pth
- Test data must be populated by
python -m src.preprocess_data:data/landmarks/test.npydata/processed/test/{A,B,C}/
Outputs written to results/ with a YYYYMMDD_HHMMSS timestamp suffix:
| File | Description |
|---|---|
comparison_{ts}.json |
Full comparison report (accuracy, confusion matrices, inference times, winners) |
confusion_matrices_{ts}.png |
Side-by-side confusion matrix plots for both models |
accuracy_comparison_{ts}.png |
Side-by-side overall accuracy bar chart |
per_class_accuracy_{ts}.png |
Per-class accuracy breakdown |
Latest results (results/comparison_20260503_113334.json, 242 test samples, classes A/B/C):
| Metric | Landmarks | CNN |
|---|---|---|
| Test accuracy | 99.17% | 99.59% |
| Avg inference time | 0.019 ms/sample | 1.313 ms/sample |
| Speed advantage | 70× faster | — |
| Class | Winner |
|---|---|
| A | CNN (100% vs 98.7%) |
| B | Landmarks (100% vs 98.8%) |
| C | CNN (100% vs 98.8%) |
Overall winner: CNN (accuracy) · Faster model: Landmarks (70× faster, 0.4% accuracy trade-off)
All runtime parameters (paths, hyperparameters, thresholds) are centralized in config/config.yaml. Modify this file to change behavior without editing code.
Use the centralized logger:
from src.utils.logger import setup_logger
logger = setup_logger(__name__)
logger.info("Starting data collection...")Log files are written to logs/ with a timestamped filename.
- Webcam not detected: Check permissions, try different camera index.
- MediaPipe failures: Improve lighting, adjust
min_detection_confidencein the config. - Slow inference: Lower resolution or use the landmarks-based model.
- Imports failing: Ensure virtualenv is activated and dependencies are installed.
- Expand supported signs (D–Z)
- Add more robust preprocessing and augmentation
- Evaluate MobileNet/ResNet architectures
- Package the project for deployment (CLI, web, or mobile)
See LICENSE for licensing details. Acknowledge MediaPipe, TensorFlow, and OpenCV projects.