Skip to content

coffeedrunkpanda/King-County-Housing-Analysis

Repository files navigation

King County House Sales - Machine Learning Price Prediction

Python Status License Data

๐Ÿ“– Project Overview

A comprehensive machine learning project to predict house prices in King County, Washington, using multiple regression algorithms, feature engineering techniques and outlier flagging.

This project analyzes the King County House Sales dataset containing 21,613 records and 21 features to build accurate price prediction models. The workflow follows a systematic approach from baseline modeling through feature engineering to hyperparameter optimization.

๐Ÿ† Key Results & Findings

This project successfully improved house price prediction accuracy by ~30% compared to the baseline model.

Best Model: Gradient Boosting Regressor (v1 Dataset)

Final Performance: Rยฒ 0.942 | MAPE 11.7% | RMSE $86,770

Improvement: Significant error reduction from the initial Linear Regression baseline (Rยฒ 0.716, RMSE $191,531).

Model Stage Key Technique Test Rยฒ Test RMSE
1. Baseline Linear Regression (Raw Data) 0.716 $191,531
2. Intermediate Random Forest (Default) 0.896 $115,994
3. Optimized Gradient Boosting (Tuned) 0.942 $86,770

Critical Insight:

Feature engineering (dropping columns, temporal features) provided diminishing returns. The simplest robust dataset (v1: Cleaned + Outlier Flags) paired with a powerful Gradient Boosting model yielded the best generalization, proving that model selection and hyperparameter tuning were the primary drivers of performance.

๐Ÿ“Š Dataset Description

The dataset includes house sale information with the following key features:

  • Target Variable: price - House sale price
  • Property Features: bedrooms, bathrooms, sqft_living, sqft_lot, floors, sqft_above, sqft_basement
  • Quality Indicators: condition (scale 1-5), grade (scale 1-11), waterfront, view
  • Temporal Features: date, yr_built, yr_renovated
  • Location Features: zipcode, lat, long
  • Neighborhood Metrics: sqft_living15, sqft_lot15 (neighboring properties in 2015)

๐Ÿ“‚ Project Workflow & Key Results

The project consists of 6 main notebooks executed sequentially:

  • Initial data exploration and minimal cleaning
  • Date feature transformation (year_sold, month_sold, day_sold)
  • Train-test split (80/20, random_state=13)
  • Baseline Linear Regression implementation (normalized and non-normalized)
  • Baseline Random Forest
  • Baseline XGBoost

Baseline Model Performance

Model Test Rยฒ Test RMSE
LinearRegression 0.716 191,531
LinearRegression (normalized) 0.683 202,287
RandomForestRegressor 0.896 115,994
XGBRegressor 0.902 112,860

The complete Metrics are available in the link.

  • Lasso (L1 Regularization): Identified most influential features - sqft_living (highest), grade, lat, sqft_above, yr_built
  • Ridge (L2 Regularization): Analyzed multicollinearity among correlated features
  • Random Forest & XGBoost Feature Importance: Both models showed strong agreement on top predictive features (sqft_living, waterfront, grade, lat)
  • No features were immediately dropped based on regularization analysis
  • Analysis: Identified 11 high-value properties priced above $4M (max: $7.7M).
  • Quantile Flagging: Established 99th ($1.96M) and 95th ($1.15M) percentile thresholds, creating binary flags (q_99, q_95) to retain valuable data points.
  • Performance: Flagging outliers yielded superior model performance compared to dropping them.
  • Leakage Study: Conducted a statistical analysis on data leakage from outlier flagging. Result: No significant performance difference was found between leaked and non-leaked methods;the original pipeline was retained for simplicity, noting the negligible impact.

Outlier Strategy Comparison (Test Results)

Model Strategy Test Rยฒ Test RMSE Impact
Random Forest Flagging Outliers 0.932 93,599 Best Performance
Random Forest Dropping Top 1% 0.874 104,088 Significant drop in accuracy
Random Forest Dropping Top 5% 0.872 76,309* Lower RMSE due to smaller range
XGBoost Flagging Outliers 0.926 97,799 Strong performance
XGBoost Dropping Top 1% 0.889 97,618 Worse Rยฒ fit
XGBoost Dropping Top 5% 0.881 73,599* Worse Rยฒ fit

Note: RMSE naturally decreases when dropping expensive houses (Top 5%) because the target range is smaller, but the Rยฒ (model fit) drops significantly, indicating worse predictive power.

The complete Metrics are available in the link.

  • Dropped features based on statistical significance and performance testing:
    • sqft_lot: High OLS p-value, minimal model impact when removed
    • floors: Similar justification
    • month_sold, day_sold: Low Ridge/Lasso coefficients
    • yr_renovated: Replaced with engineered features
  • Final feature set: 17 features after dropping 4 low-impact variables
  • Models showed slight improvements or maintained performance after feature reduction
  • Created new temporal features:
    • Age_at_Sale: Calculated from yr_built
    • Year_since_Renovation: Time since last renovation
    • was_renovated: Binary indicator for renovation history
  • Dropped redundant features: id, date, yr_built, year_sold, month_sold
  • Model Performance:
    • Linear Regression: Rยฒ 0.70 (train), 0.72 (test)
    • Random Forest: Rยฒ 0.98 (train), 0.90 (test) - showing slight overfitting
    • XGBoost: Comparable performance to Random Forest

We performed hyperparameter tuning using RandomizedSearchCV (50 iterations, 3-fold CV) on three dataset versions.

Dataset versions:

  • v1: Baseline clean dataset with outlier flags (quantiles 99 and 95).
  • v2: Feature selection over v1.
  • v3: Feature engineering over v2.

Model Comparison (Test Set Results)

Dataset Version Model Test Rยฒ Test RMSE Test MAPE
v1 (Outliers Only) Gradient Boosting 0.942 86,770 11.7%
v1 (Outliers Only) XGBoost 0.937 90,537 11.6%
v2 (Feature Drop) Gradient Boosting 0.937 90,424 11.9%
v2 (Feature Drop) XGBoost 0.936 91,140 11.7%
v3 (Feature Eng.) XGBoost 0.936 91,103 11.9%
v3 (Feature Eng.) Gradient Boosting 0.935 91,838 12.4%

Best Model: Gradient Boosting

  • Best parameters: {'subsample': 0.8, 'n_estimators': 500, 'min_samples_split': 10, 'min_samples_leaf': 4, 'max_features': 'log2', 'max_depth': 5, 'learning_rate': 0.1}
  • Performance: Rยฒ 0.9752 (train), 0.9417 (test)
  • Metrics: MAE 57,844, RMSE 86,770, MAPE 11.7%
  • Dataset: v1

The complete Metrics are available in the Metrics folder for: v1, v2, v3.

๐Ÿ› ๏ธ Technologies Used

  • Python Libraries: pandas, numpy, scikit-learn, xgboost, statsmodels, seaborn, matplotlib, plotly
  • Models: Linear Regression, Ridge, Lasso, Random Forest, XGBoost, Gradient Boosting
  • Techniques: StandardScaler, MinMaxScaler, RandomizedSearchCV, train-test split
  • Evaluation Metrics: Rยฒ, Adjusted Rยฒ, MAE, RMSE, MAPE

๐Ÿš€ Getting Started

Prerequisites:

  • Python 3.10 or higher

Installation

  1. Clone the repository
git clone git@github.com:coffeedrunkpanda/King-County-Housing-Analysis.git
cd King-County-Housing-Analysis
  1. Install dependencies
pip install -r requirements.txt
  1. Usage

Download the data from Kaggle - King County House Sales dataset.

Each notebook builds upon the previous one and should be executed in order (1โ†’6). The project uses a consistent random state (seed=13) for reproducibility.

๐Ÿ‘ค Authors

Built by me (@coffeedrunkpanda) together with @alexcardenasgutierrez-droid and @sheetansh for the Ironhack bootcamp of Data Science & Machine Learning.

About

Predicting House Prices in King County with 94% accuracy using Gradient Boosting & XGBoost. From raw data to optimized model.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors