🎬 CineMatch — MovieLens 1M Recommendation System

Author: Srishti
Date: February 2026
Dataset: MovieLens 1M (via Kaggle)

Overview

CineMatch is an end-to-end movie recommendation system built on the MovieLens 1M dataset. It explores multiple recommendation approaches — from a simple popularity-based baseline to collaborative filtering and content-based machine learning models — and culminates in a production-ready Streamlit web application that delivers personalized movie recommendations.

Dataset

The MovieLens 1M dataset contains:

1,000,209 ratings from 6,040 users on 3,900+ movies
Three files: ratings.csv, movies.csv, and users.csv
Ratings on a scale of 1–5 (half-star increments not included)
User demographics: gender, age group, and occupation

Project Structure

The notebook is organized into six sections:

Section 1 — Data Loading & Cleaning

Loads all three dataset files, performs a data quality check for missing values, and merges them into a single master dataframe containing ratings, movie metadata, and user demographics.

Section 2 — Exploratory Data Analysis (EDA)

Rating Distribution: Ratings skew positive, with a median and mode of 4 stars
User Demographics: ~72% male users; top occupations include college students, executives, and engineers
Long Tail Analysis: A small number of movies attract most ratings; the majority have very few
Sparsity Analysis: The user-movie matrix is ~95% sparse, making matrix completion the core challenge

Section 3 — Model 1: Baseline (Popularity-Based)

Implements a weighted rating formula inspired by IMDB's Top 250:

$$WR = \frac{v}{v+m} \times R + \frac{m}{v+m} \times C$$

Where v = number of votes, m = minimum votes threshold (80th percentile), R = movie's average rating, and C = global mean rating. This approach balances quality and popularity to avoid surfacing niche movies with few ratings.

Section 4 — Advanced Model Training

Three models are trained and evaluated:

Model	Type	Library
SVD	Matrix Factorization	`scikit-surprise`
KNN (Item-Based)	Collaborative Filtering	`scikit-surprise`
Random Forest	Content-Based (demographics + genres)	`scikit-learn`

Section 5 — Model Evaluation & Final Selection

All models are compared using RMSE and MAE on an 80/20 train-test split. SVD achieves the best performance:

Model	RMSE	MAE
Baseline (Weighted Avg)	~1.03	~0.82
SVD (Matrix Factorization)	~0.87	~0.68
KNN (Item-Based)	~0.93	~0.74
Random Forest	~0.96	~0.76

Section 6 — Production Streamlit App

The winning model (SVD) is serialized and packaged into a Streamlit application that allows users to:

Select a User ID and view their rating history
Generate personalized movie recommendations
Browse movie details including genres and predicted ratings

Installation & Usage

Prerequisites

pip install numpy pandas matplotlib seaborn scipy scikit-learn scikit-surprise kagglehub streamlit

Running the Notebook

Clone this repository or open the notebook in Kaggle/Colab
Ensure Kaggle API credentials are configured (for kagglehub download)
Run all cells in order

Running the Streamlit App

After running the notebook (which saves model artifacts via pickle):

streamlit run app.py

Key Findings

Positive rating bias: Most users rate movies between 3–4 stars
Long tail challenge: A cold start problem exists for the majority of movies with few ratings
High sparsity (~95%): Makes collaborative filtering inherently challenging but solvable via matrix factorization
SVD wins: Latent factor models outperform neighborhood-based and content-based approaches on this dataset

Tech Stack

Python 3.12
Pandas & NumPy — data manipulation
Matplotlib & Seaborn — visualization
scikit-learn — Random Forest, preprocessing, evaluation
scikit-surprise — SVD and KNN collaborative filtering
Streamlit — production web application
Kaggle — dataset source and notebook hosting

Skills Demonstrated

Data engineering and merging large datasets
Exploratory data analysis and visualization
Machine learning model development and comparison
Model evaluation (RMSE, MAE)
Production-ready application development
Clear documentation and communication

👩‍💻 Author

Srishti Rajput
Cinematch — Your Personalized Movie Recommendation System

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
app		app
README.md		README.md
cinematch.ipynb		cinematch.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🎬 CineMatch — MovieLens 1M Recommendation System

Overview

Dataset

Project Structure

Section 1 — Data Loading & Cleaning

Section 2 — Exploratory Data Analysis (EDA)

Section 3 — Model 1: Baseline (Popularity-Based)

Section 4 — Advanced Model Training

Section 5 — Model Evaluation & Final Selection

Section 6 — Production Streamlit App

Installation & Usage

Prerequisites

Running the Notebook

Running the Streamlit App

Key Findings

Tech Stack

Skills Demonstrated

👩‍💻 Author

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

🎬 CineMatch — MovieLens 1M Recommendation System

Overview

Dataset

Project Structure

Section 1 — Data Loading & Cleaning

Section 2 — Exploratory Data Analysis (EDA)

Section 3 — Model 1: Baseline (Popularity-Based)

Section 4 — Advanced Model Training

Section 5 — Model Evaluation & Final Selection

Section 6 — Production Streamlit App

Installation & Usage

Prerequisites

Running the Notebook

Running the Streamlit App

Key Findings

Tech Stack

Skills Demonstrated

👩‍💻 Author

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages