Author: Srishti
Date: February 2026
Dataset: MovieLens 1M (via Kaggle)
CineMatch is an end-to-end movie recommendation system built on the MovieLens 1M dataset. It explores multiple recommendation approaches — from a simple popularity-based baseline to collaborative filtering and content-based machine learning models — and culminates in a production-ready Streamlit web application that delivers personalized movie recommendations.
The MovieLens 1M dataset contains:
- 1,000,209 ratings from 6,040 users on 3,900+ movies
- Three files:
ratings.csv,movies.csv, andusers.csv - Ratings on a scale of 1–5 (half-star increments not included)
- User demographics: gender, age group, and occupation
The notebook is organized into six sections:
Loads all three dataset files, performs a data quality check for missing values, and merges them into a single master dataframe containing ratings, movie metadata, and user demographics.
- Rating Distribution: Ratings skew positive, with a median and mode of 4 stars
- User Demographics: ~72% male users; top occupations include college students, executives, and engineers
- Long Tail Analysis: A small number of movies attract most ratings; the majority have very few
- Sparsity Analysis: The user-movie matrix is ~95% sparse, making matrix completion the core challenge
Implements a weighted rating formula inspired by IMDB's Top 250:
Where v = number of votes, m = minimum votes threshold (80th percentile), R = movie's average rating, and C = global mean rating. This approach balances quality and popularity to avoid surfacing niche movies with few ratings.
Three models are trained and evaluated:
| Model | Type | Library |
|---|---|---|
| SVD | Matrix Factorization | scikit-surprise |
| KNN (Item-Based) | Collaborative Filtering | scikit-surprise |
| Random Forest | Content-Based (demographics + genres) | scikit-learn |
All models are compared using RMSE and MAE on an 80/20 train-test split. SVD achieves the best performance:
| Model | RMSE | MAE |
|---|---|---|
| Baseline (Weighted Avg) | ~1.03 | ~0.82 |
| SVD (Matrix Factorization) | ~0.87 | ~0.68 |
| KNN (Item-Based) | ~0.93 | ~0.74 |
| Random Forest | ~0.96 | ~0.76 |
The winning model (SVD) is serialized and packaged into a Streamlit application that allows users to:
- Select a User ID and view their rating history
- Generate personalized movie recommendations
- Browse movie details including genres and predicted ratings
pip install numpy pandas matplotlib seaborn scipy scikit-learn scikit-surprise kagglehub streamlit- Clone this repository or open the notebook in Kaggle/Colab
- Ensure Kaggle API credentials are configured (for
kagglehubdownload) - Run all cells in order
After running the notebook (which saves model artifacts via pickle):
streamlit run app.py- Positive rating bias: Most users rate movies between 3–4 stars
- Long tail challenge: A cold start problem exists for the majority of movies with few ratings
- High sparsity (~95%): Makes collaborative filtering inherently challenging but solvable via matrix factorization
- SVD wins: Latent factor models outperform neighborhood-based and content-based approaches on this dataset
- Python 3.12
- Pandas & NumPy — data manipulation
- Matplotlib & Seaborn — visualization
- scikit-learn — Random Forest, preprocessing, evaluation
- scikit-surprise — SVD and KNN collaborative filtering
- Streamlit — production web application
- Kaggle — dataset source and notebook hosting
- Data engineering and merging large datasets
- Exploratory data analysis and visualization
- Machine learning model development and comparison
- Model evaluation (RMSE, MAE)
- Production-ready application development
- Clear documentation and communication
Srishti Rajput
Cinematch — Your Personalized Movie Recommendation System