Skip to content

JohnnySolo/Data-Analysis-Project---Blockbuster-Movies

Repository files navigation

🎬 Hollywood Greenlight Analytics: Predicting and Shaping Movie Profitability

📑 Executive Summary

This data analysis project investigates the historical drivers of box office ROI to help film studios reduce investment risk.

By moving beyond traditional intuition-based forecasting, we utilized advanced Exploratory Data Analysis (EDA), predictive Machine Learning, and Causal Inference (Meta-Learners) to simulate "what-if" business scenarios.

The analysis identified the distinct financial impact of talent, release timing, and narrative themes, translating complex causal insights into actionable business strategies for stakeholders.


📌 Project Overview

💼 The Analytical Problem & Business Objective

The motion picture industry is notoriously high-risk, characterized by a "fat-tail" distribution where a small number of films generate the vast majority of profits. Traditionally, greenlighting a project relies heavily on executive intuition.

The Objective: To conduct a rigorous data analysis that identifies the true, statistically significant drivers of commercial success (Return on Investment). By quantifying the impact of specific business interventions—such as talent selection and release timing—the goal is to provide data-driven recommendations that optimize film packaging and minimize financial risk.

📂 Data Sources & Analytical Preprocessing

We built a comprehensive master dataset by analyzing over 7,000 movies from multiple sources, including The Movie Database (TMDB) 5000, IMDb metadata, and Kaggle Global Movie Franchises.

  • Data Engineering & Cleansing: Merged diverse datasets into a unified master table (imdb_score_features, roi_features, master_table).
  • Distribution Handling: Addressed the extreme "fat-tail" distribution of box office revenues using logarithmic transformations to ensure statistical validity.
  • Leakage Prevention: Strictly enforced chronological integrity by removing all post-release metrics (e.g., audience votes, user reviews) to ensure the analysis accurately simulated pre-release forecasting.

🛠️ Analytical Methodology & ML Integration

We utilized statistical testing and machine learning as analytical tools to validate hypotheses and measure business impact:

  • Exploratory Data Analysis (EDA): Applied the R.I.C.E. framework (Reach, Impact, Confidence, Effort) to uncover initial industry patterns and direct the analytical focus.
  • Statistical Testing: Validated EDA insights rigorously using Two-Sample T-Tests, One-Way ANOVA, and Variance Inflation Factor (VIF) checks for multicollinearity.
  • Predictive Modeling: Deployed Supervised ML (Logistic Regression, Random Forest, XGBoost) combined with 5-Fold Cross-Validation to evaluate the baseline probability of profitability.
  • Causal Machine Learning (Prescriptive Analysis): Moved beyond simple correlation by implementing Meta-Learners (X-Learner), Inverse Probability of Treatment Weighting (IPTW), and Causal Forests. This allowed us to calculate the Average Treatment Effect (ATE) and simulate "what-if" scenarios across distinct target audience segments.

📊 Key Findings & Business Recommendations

Our causal analysis yielded the following data-driven insights and strategic recommendations:

  • The Talent Multiplier: Upgrading to a "Proven" director is the ultimate risk-mitigation strategy, actively increasing a movie's baseline probability of profitability by ~18 percentage points.
    • Recommendation: Independent studios lacking major marketing safety nets should prioritize budget allocation toward proven directing talent over high-cost visual effects.
  • The Counter-Programming Advantage: Releasing an indie or mid-budget alternative genre film (like Comedy or Drama) during the summer blockbuster season provides a ~7.4% to 9.1% lift in profitability probability.
    • Recommendation: Studios should leverage counter-programming, capitalizing on the massive influx of summer cinema foot traffic rather than avoiding major action tentpoles.
  • Narrative Optimization: Anchoring scripts on reliable, human-centric thematic hooks increases the probability of profitability by 8% to 12.8%.
    • Recommendation: This narrative optimization should be strictly applied to R-rated films, maximizing appeal where the potential audience pool is already restricted by age.

⚠️ Analytical Limitations & Trade-offs

To ensure data integrity and stakeholder trust, we must acknowledge the limitations of this analysis:

  • Unobserved Confounding (Marketing Budgets): The dataset lacks comprehensive global marketing (P&A) budgets for every film. Because marketing heavily influences box office gross, our Causal ML models operate under the assumption that marketing spend is generally proportional to production budget, which may skew the Average Treatment Effect (ATE) for outlier viral campaigns.
  • Target Leakage in Treatment Assignment: A post-analysis audit revealed target leakage within the director_tier treatment variable. Because the tier was calculated using a global win_rate (which inadvertently included the target variable is_profitable), the Average Treatment Effect (ATE) for directors is mathematically overstated. Future iterations of this analysis must calculate director tiers using strictly historical, pre-release rolling averages to maintain pure chronological causality.
  • Cultural Temporal Shifts: The analysis relies on historical data. Sudden macroeconomic shifts or changes in consumer behavior (e.g., the post-COVID streaming pivot) introduce unquantifiable variables that historical modeling cannot fully anticipate.
  • The Subjectivity of "Art": While we categorized narrative themes and genre hooks, the subjective "chemistry" of a cast or the cultural zeitgeist of a script remains unquantifiable. These models act as risk-reduction tools for executives, not guarantees of artistic success.

About

An end-to-end data science project predicting movie profitability and prescribing causal business strategies for film studios using XGBoost and Meta-Learners

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors