An end-to-end data analysis project on Netflix's content library using Python — covering data cleaning, exploratory data analysis, statistical hypothesis testing, probability distributions, and advanced analytical deep dives.
This project performs a comprehensive analysis of Netflix's content catalog (netflix_titles.csv) to uncover patterns in content type, genre distribution, country contributions, release trends, and viewing ratings. The analysis progressively builds from basic data cleaning to advanced statistical reasoning and probabilistic modelling.
Dataset: Netflix Movies and TV Shows — Kaggle
Records: ~8,800 titles | Features: 12 columns
Tools: Python, Pandas, NumPy, Matplotlib, Seaborn, SciPy, WordCloud
netflix-content-analysis/
│
├── data/
│ ├── netflix_titles.csv ← Raw dataset (download from Kaggle)
│ └── netflix_cleaned.csv ← Cleaned dataset (output after preprocessing)
│
├── images/ ← All plot outputs saved here
│ └──....
├── notebook/
│ └── Netflix_data.ipynb ← Main analysis notebook
│
└── README.md
- Handled missing values in
director,cast,country,date_added - Parsed
date_addedto extractyear_addedandmonth_added - Engineered features:
duration_num,duration_unit,genre_list - Removed duplicates on
(title, release_year) - Standardised text columns (
type,rating)
| Analysis | Key Finding |
|---|---|
| Movies vs TV Shows | ~70% of content is Movies |
| Top countries | USA dominates, followed by India & UK |
| Rating distribution | TV-MA is the most common rating |
| Top genres | International Movies, Dramas, Comedies lead |
| Top directors & actors | Rajiv Chilaka, Anupam Kher among top contributors |
| Movie duration | Right-skewed distribution, mean ~99 min |
| Monthly trends | January & July see peak content additions |
Applied real-world probability distributions to Netflix data:
- Bernoulli — P(randomly picked title is a Movie) = 0.696
- Binomial — P(exactly 7 out of 10 random picks are Movies)
- Poisson — Modelling average new content arrivals per year (λ ≈ 119)
| Test | Question | Result |
|---|---|---|
| Independent t-test | Is movie duration significantly different from TV show seasons? | Reject H₀ (p < 0.05) |
| Shapiro-Wilk + QQ-Plot | Are movie durations normally distributed? | Not normal — right skewed |
| Chi-Square test | Is there an association between content type and rating? | Reject H₀ — significant association |
| WordCloud | Most frequent words in Netflix descriptions | family, love, life, friend… |
- Conditional Probability — P(genre | country): Each country's content personality mapped as a probability heatmap
- Central Limit Theorem Demo — Showed how sample means of skewed movie durations converge to normal at n ≥ 30
- Outlier Detection — IQR, Z-Score, and Modified Z-Score methods compared on movie durations
- Content Gap Analysis — Identified underserved genres per country vs. global average
- YoY Growth Rate + CAGR — Computed Compound Annual Growth Rate; peak absolute additions in 2019
- Multi-Variable EDA — Pair plots, correlation heatmap, hexbin density (duration vs release year), content lag distribution
Investigated why Netflix's YoY growth rate declined from 423% (2016) to -20% (2021):
- Base Effect — % growth naturally falls as cumulative base grows
- Absolute additions — Peaked in 2018–2019, genuine slowdown after
- COVID Impact — Monthly analysis revealed production halt in Mar–Jun 2020
- Strategy shift — Netflix moved from quantity licensing to quality originals post-2020
- Netflix is movie-heavy — 69.6% of titles are Movies, but TV Show additions grew faster post-2018
- USA + India drive the catalog — Together accounting for over 40% of all content
- Content lag is short — Median gap between a movie's release and its Netflix addition is just 1–2 years
- Rating skews mature — TV-MA and R dominate; Kids/Family content is proportionally underrepresented
- Growth % is misleading alone — 423% in 2016 is a base-effect illusion; absolute additions peaked in 2019
Python 3.x
├── pandas — data manipulation
├── numpy — numerical computing
├── matplotlib — base visualisations
├── seaborn — statistical plots
├── scipy.stats — hypothesis testing, distributions
└── wordcloud — text frequency visualisation
# 1. Clone the repo
git clone https://github.com/byteephantom/netflix-content-analysis.git
cd netflix-content-analysis
# 2. Install dependencies
pip install pandas numpy matplotlib seaborn scipy wordcloud
# 3. Download dataset
# Place netflix_titles.csv inside the data/ folder
# Dataset: https://www.kaggle.com/datasets/shivamb/netflix-shows
# 4. Launch notebook
jupyter notebook notebook/Netflix_data.ipynb- K-Means Clustering — Group Netflix content into natural content profiles
- Classification Model — Predict Movie vs TV Show from features (Logistic Regression + Random Forest)
- TF-IDF Content Recommender — "You watched X → here are similar titles"
- Sentiment Analysis — VADER on descriptions; do thrillers use more negative language?
- Streamlit App — Interactive dashboard with live recommender, deployed on Streamlit Cloud
Source: Kaggle — Netflix Movies and TV Shows
The dataset is not included in this repository. Download netflix_titles.csv from Kaggle and place it in the project root before running the notebook.
Quantum Nomad (@byteephantom)
This project is open source and available under the MIT License.