Skip to content

CodeWithSrish/CineMatch

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 

Repository files navigation

🎬 CineMatch — MovieLens 1M Recommendation System

Author: Srishti
Date: February 2026
Dataset: MovieLens 1M (via Kaggle)


Overview

CineMatch is an end-to-end movie recommendation system built on the MovieLens 1M dataset. It explores multiple recommendation approaches — from a simple popularity-based baseline to collaborative filtering and content-based machine learning models — and culminates in a production-ready Streamlit web application that delivers personalized movie recommendations.


Dataset

The MovieLens 1M dataset contains:

  • 1,000,209 ratings from 6,040 users on 3,900+ movies
  • Three files: ratings.csv, movies.csv, and users.csv
  • Ratings on a scale of 1–5 (half-star increments not included)
  • User demographics: gender, age group, and occupation

Project Structure

The notebook is organized into six sections:

Section 1 — Data Loading & Cleaning

Loads all three dataset files, performs a data quality check for missing values, and merges them into a single master dataframe containing ratings, movie metadata, and user demographics.

Section 2 — Exploratory Data Analysis (EDA)

  • Rating Distribution: Ratings skew positive, with a median and mode of 4 stars
  • User Demographics: ~72% male users; top occupations include college students, executives, and engineers
  • Long Tail Analysis: A small number of movies attract most ratings; the majority have very few
  • Sparsity Analysis: The user-movie matrix is ~95% sparse, making matrix completion the core challenge

Section 3 — Model 1: Baseline (Popularity-Based)

Implements a weighted rating formula inspired by IMDB's Top 250:

$$WR = \frac{v}{v+m} \times R + \frac{m}{v+m} \times C$$

Where v = number of votes, m = minimum votes threshold (80th percentile), R = movie's average rating, and C = global mean rating. This approach balances quality and popularity to avoid surfacing niche movies with few ratings.

Section 4 — Advanced Model Training

Three models are trained and evaluated:

Model Type Library
SVD Matrix Factorization scikit-surprise
KNN (Item-Based) Collaborative Filtering scikit-surprise
Random Forest Content-Based (demographics + genres) scikit-learn

Section 5 — Model Evaluation & Final Selection

All models are compared using RMSE and MAE on an 80/20 train-test split. SVD achieves the best performance:

Model RMSE MAE
Baseline (Weighted Avg) ~1.03 ~0.82
SVD (Matrix Factorization) ~0.87 ~0.68
KNN (Item-Based) ~0.93 ~0.74
Random Forest ~0.96 ~0.76

Section 6 — Production Streamlit App

The winning model (SVD) is serialized and packaged into a Streamlit application that allows users to:

  • Select a User ID and view their rating history
  • Generate personalized movie recommendations
  • Browse movie details including genres and predicted ratings

Installation & Usage

Prerequisites

pip install numpy pandas matplotlib seaborn scipy scikit-learn scikit-surprise kagglehub streamlit

Running the Notebook

  1. Clone this repository or open the notebook in Kaggle/Colab
  2. Ensure Kaggle API credentials are configured (for kagglehub download)
  3. Run all cells in order

Running the Streamlit App

After running the notebook (which saves model artifacts via pickle):

streamlit run app.py

Key Findings

  • Positive rating bias: Most users rate movies between 3–4 stars
  • Long tail challenge: A cold start problem exists for the majority of movies with few ratings
  • High sparsity (~95%): Makes collaborative filtering inherently challenging but solvable via matrix factorization
  • SVD wins: Latent factor models outperform neighborhood-based and content-based approaches on this dataset

Tech Stack

  • Python 3.12
  • Pandas & NumPy — data manipulation
  • Matplotlib & Seaborn — visualization
  • scikit-learn — Random Forest, preprocessing, evaluation
  • scikit-surprise — SVD and KNN collaborative filtering
  • Streamlit — production web application
  • Kaggle — dataset source and notebook hosting

Skills Demonstrated

  • Data engineering and merging large datasets
  • Exploratory data analysis and visualization
  • Machine learning model development and comparison
  • Model evaluation (RMSE, MAE)
  • Production-ready application development
  • Clear documentation and communication

👩‍💻 Author

Srishti Rajput
Cinematch — Your Personalized Movie Recommendation System

About

A movie recommendation system built using collaborative filtering and machine learning techniques, delivering personalized movie suggestions through an interactive application.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors