EmotionRecognition

Build a multiclass emotion‐recognition system that classifies six emotional states—anger, disgust, fear, happiness, neutral, and sadness—from short voice recordings by extracting audio features (e.g., MFCCs and mel-spectrograms) and training four separate models (SVM, KNN, MLP, and CNN) to compare their performance.

🤢 😄 😢 😠 😨 😐

Project Structure

EmotionRecognition/
├── Data/                   # Raw audio files and CSV label files, CREMA-D
├── Utils/                  # Feature‐extraction and preprocessing helpers
├── Models/                 # Model definitions (e.g., cnn_model.py, svm_model.py, etc.)
├── Training/               # Training scripts for each model
├── Trained_Models/         # Saved model weights (.pth, .pkl)
├── Frontend/               # Streamlit app code
├── Test/                   # Unit and integration tests
├── requirements.txt        # Python dependencies

Prerequistes + Notes about setup

1) Our KNN Model is larger than 100MB and therefore we had to use HuggingFace to upload our model
2) Python Version : Python 3.10 , Some issues with installing dependecies if not using this version

Setup

1) Clone repo
git clone https://github.com/shuklashreyas/EmotionRecognition
cd EmotionRecognition
brew install git-lfs
git lfs install

2) Create conda environment
conda create -n emotion-voice python=3.10 -y
conda activate emotion-voice

3) Install dependencies
pip install -r requirements.txt
pip install streamlit
pip install streamlit-webrtc
pip install huggingface_hub
pip install soundfile
pip install torch torchvision torchaudio

4) add secrets to the environment
(Mac)
export HUGGINGFACE_HUB_TOKEN=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

(Windows) 
setx HUGGINGFACE_HUB_TOKEN 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

Running the Program

streamlit run Frontend/app.py

Open your browser and navigate to: http://localhost:8501

Frontend View & Usage

Upload Audio Feature

Upload Audio tab → Choose a WAV file & model → See the predicted emotion

Record Audio Feature

Record Audio tab → Click Start/Stop → Save & Predict → Review your recording + prediction

Model Performance

Our emotion recognition system achieves the following accuracies on the test dataset:

Model	Accuracy	Notes
CNN	98%	Deep learning model using mel-spectrograms
SVM	96%	Support Vector Machine with feature scaling
MLP	65%	Multi-Layer Perceptron neural network
KNN	48%	K-Nearest Neighbors classifier

Performance metrics are based on 6-class emotion classification (anger, disgust, fear, happiness, neutral, sadness) using the CREMA-D dataset.

How It Works

Preprocessing:

Load WAV → Extract 128×128 mel-spectrogram from audio signal
Apply feature normalization and data augmentation techniques
Convert audio to standardized format for consistent processing

Prediction:

CNN: Deep convolutional model trained directly on mel-spectrograms for pattern recognition
SVM/MLP/KNN: Traditional ML approach:
1. Flatten mel-spectrogram into feature vector
2. Apply feature scaling for normalization
3. Use PCA for dimensionality reduction
4. Feed into classical machine learning model

The CNN approach leverages spatial patterns in spectrograms, while traditional ML models rely on statistical features extracted from the flattened audio representations.

Training & Fine-tuning

If you want to retrain or fine-tune the models with your own data:

# Train CNN model
python Training/cnn_train.py

# Train SVM model
python Training/svm_train.py

# Train MLP model
python Training/mlp_train.py

# Train KNN model
python Training/knn_train.py

Each training script includes hyperparameter tuning, cross-validation, and model evaluation metrics.

Libraries Used

Audio Processing: librosa, soundfile, pyaudio
Machine Learning: scikit-learn, tensorflow/keras, torch
Feature Extraction: librosa (MFCC, mel-spectrograms), numpy
Web Interface: streamlit, streamlit-webrtc
Data Handling: pandas, numpy, matplotlib
Model Persistence: pickle, joblib

Dataset

This project uses the CREMA-D (Crowdsourced Emotional Multimodal Actors Dataset) containing:

7,442 audio clips from 91 actors
6 emotion categories: anger, disgust, fear, happiness, neutral, sadness
Balanced dataset with demographic diversity
Please refer to orginal dataset extracted

Authors

Shreyas Shukla - CNN, Code Modularity, Frontend
Pavithra Ponnolu - SVM, Research
Kashvi Mehta - MLP, Dataset Extraction
Josh Len - KNN, Research

Demo & Slides & Final Report

Future Work

Expand Dataset: Collect noisy, accented, multi-device audio and diverse dialects for greater robustness.
Multi-Modal Fusion: Integrate audio with video or physiological signals to enrich emotion cues.
On-Device Deployment: Compress the model for low-latency, real-time inference on mobile/embedded devices.

Acknowledgments

CREMA-D dataset creators for providing high-quality emotional speech data
Open-source community for the excellent libraries that made this project possible
This project was made during Summer for CS4100 (AI) NEU

Name		Name	Last commit message	Last commit date
Latest commit History 69 Commits
Data		Data
Frontend		Frontend
Models		Models
Test		Test
Trained_Models		Trained_Models
Training		Training
Utils		Utils
emotion-cnn-single		emotion-cnn-single
.DS_Store		.DS_Store
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
frontend1.png		frontend1.png
frontend2.png		frontend2.png
projREQ.pdf		projREQ.pdf
requirements.txt		requirements.txt
runtime.txt		runtime.txt
test.py		test.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

EmotionRecognition

Project Structure

Prerequistes + Notes about setup

Setup

Running the Program

Frontend View & Usage

Upload Audio Feature

Record Audio Feature

Model Performance

How It Works

Preprocessing:

Prediction:

Training & Fine-tuning

Libraries Used

Dataset

Authors

Demo & Slides & Final Report

Future Work

Acknowledgments

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

EmotionRecognition

Project Structure

Prerequistes + Notes about setup

Setup

Running the Program

Frontend View & Usage

Upload Audio Feature

Record Audio Feature

Model Performance

How It Works

Preprocessing:

Prediction:

Training & Fine-tuning

Libraries Used

Dataset

Authors

Demo & Slides & Final Report

Future Work

Acknowledgments

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages