King County House Sales - Machine Learning Price Prediction

King County House Sales - Machine Learning Price Prediction

📖 Project Overview

A comprehensive machine learning project to predict house prices in King County, Washington, using multiple regression algorithms, feature engineering techniques and outlier flagging.

This project analyzes the King County House Sales dataset containing 21,613 records and 21 features to build accurate price prediction models. The workflow follows a systematic approach from baseline modeling through feature engineering to hyperparameter optimization.

🏆 Key Results & Findings

This project successfully improved house price prediction accuracy by ~30% compared to the baseline model.

Best Model: Gradient Boosting Regressor (v1 Dataset)

Final Performance: R² 0.942 | MAPE 11.7% | RMSE $86,770

Improvement: Significant error reduction from the initial Linear Regression baseline (R² 0.716, RMSE $191,531).

Model Stage	Key Technique	Test R²	Test RMSE
1. Baseline	Linear Regression (Raw Data)	0.716	$191,531
2. Intermediate	Random Forest (Default)	0.896	$115,994
3. Optimized	Gradient Boosting (Tuned)	0.942	$86,770

Critical Insight:

Feature engineering (dropping columns, temporal features) provided diminishing returns. The simplest robust dataset (v1: Cleaned + Outlier Flags) paired with a powerful Gradient Boosting model yielded the best generalization, proving that model selection and hyperparameter tuning were the primary drivers of performance.

📊 Dataset Description

The dataset includes house sale information with the following key features:

Target Variable: price - House sale price
Property Features: bedrooms, bathrooms, sqft_living, sqft_lot, floors, sqft_above, sqft_basement
Quality Indicators: condition (scale 1-5), grade (scale 1-11), waterfront, view
Temporal Features: date, yr_built, yr_renovated
Location Features: zipcode, lat, long
Neighborhood Metrics: sqft_living15, sqft_lot15 (neighboring properties in 2015)

📂 Project Workflow & Key Results

The project consists of 6 main notebooks executed sequentially:

1. Baseline Models

Initial data exploration and minimal cleaning
Date feature transformation (year_sold, month_sold, day_sold)
Train-test split (80/20, random_state=13)
Baseline Linear Regression implementation (normalized and non-normalized)
Baseline Random Forest
Baseline XGBoost

Baseline Model Performance

Model	Test R²	Test RMSE
LinearRegression	0.716	191,531
LinearRegression (normalized)	0.683	202,287
RandomForestRegressor	0.896	115,994
XGBRegressor	0.902	112,860

The complete Metrics are available in the link.

2. Feature Importance Analysis

Lasso (L1 Regularization): Identified most influential features - sqft_living (highest), grade, lat, sqft_above, yr_built
Ridge (L2 Regularization): Analyzed multicollinearity among correlated features
Random Forest & XGBoost Feature Importance: Both models showed strong agreement on top predictive features (sqft_living, waterfront, grade, lat)
No features were immediately dropped based on regularization analysis

3. Outliers Analysis

Analysis: Identified 11 high-value properties priced above $4M (max: $7.7M).
Quantile Flagging: Established 99th ($1.96M) and 95th ($1.15M) percentile thresholds, creating binary flags (q_99, q_95) to retain valuable data points.
Performance: Flagging outliers yielded superior model performance compared to dropping them.
Leakage Study: Conducted a statistical analysis on data leakage from outlier flagging. Result: No significant performance difference was found between leaked and non-leaked methods;the original pipeline was retained for simplicity, noting the negligible impact.

Outlier Strategy Comparison (Test Results)

Model	Strategy	Test R²	Test RMSE	Impact
Random Forest	Flagging Outliers	0.932	93,599	Best Performance
Random Forest	Dropping Top 1%	0.874	104,088	Significant drop in accuracy
Random Forest	Dropping Top 5%	0.872	76,309*	Lower RMSE due to smaller range
XGBoost	Flagging Outliers	0.926	97,799	Strong performance
XGBoost	Dropping Top 1%	0.889	97,618	Worse R² fit
XGBoost	Dropping Top 5%	0.881	73,599*	Worse R² fit

Note: RMSE naturally decreases when dropping expensive houses (Top 5%) because the target range is smaller, but the R² (model fit) drops significantly, indicating worse predictive power.

The complete Metrics are available in the link.

4. Feature Selection

Dropped features based on statistical significance and performance testing:
- sqft_lot: High OLS p-value, minimal model impact when removed
- floors: Similar justification
- month_sold, day_sold: Low Ridge/Lasso coefficients
- yr_renovated: Replaced with engineered features
Final feature set: 17 features after dropping 4 low-impact variables
Models showed slight improvements or maintained performance after feature reduction

5. Feature Engineering

Created new temporal features:
- Age_at_Sale: Calculated from yr_built
- Year_since_Renovation: Time since last renovation
- was_renovated: Binary indicator for renovation history
Dropped redundant features: id, date, yr_built, year_sold, month_sold
Model Performance:
- Linear Regression: R² 0.70 (train), 0.72 (test)
- Random Forest: R² 0.98 (train), 0.90 (test) - showing slight overfitting
- XGBoost: Comparable performance to Random Forest

6. Model Optimization

We performed hyperparameter tuning using RandomizedSearchCV (50 iterations, 3-fold CV) on three dataset versions.

Dataset versions:

v1: Baseline clean dataset with outlier flags (quantiles 99 and 95).
v2: Feature selection over v1.
v3: Feature engineering over v2.

Model Comparison (Test Set Results)

Dataset Version	Model	Test R²	Test RMSE	Test MAPE
v1 (Outliers Only)	Gradient Boosting	0.942	86,770	11.7%
v1 (Outliers Only)	XGBoost	0.937	90,537	11.6%
v2 (Feature Drop)	Gradient Boosting	0.937	90,424	11.9%
v2 (Feature Drop)	XGBoost	0.936	91,140	11.7%
v3 (Feature Eng.)	XGBoost	0.936	91,103	11.9%
v3 (Feature Eng.)	Gradient Boosting	0.935	91,838	12.4%

Best Model: Gradient Boosting

Best parameters: {'subsample': 0.8, 'n_estimators': 500, 'min_samples_split': 10, 'min_samples_leaf': 4, 'max_features': 'log2', 'max_depth': 5, 'learning_rate': 0.1}
Performance: R² 0.9752 (train), 0.9417 (test)
Metrics: MAE 57,844, RMSE 86,770, MAPE 11.7%
Dataset: v1

The complete Metrics are available in the Metrics folder for: v1, v2, v3.

🛠️ Technologies Used

Python Libraries: pandas, numpy, scikit-learn, xgboost, statsmodels, seaborn, matplotlib, plotly
Models: Linear Regression, Ridge, Lasso, Random Forest, XGBoost, Gradient Boosting
Techniques: StandardScaler, MinMaxScaler, RandomizedSearchCV, train-test split
Evaluation Metrics: R², Adjusted R², MAE, RMSE, MAPE

🚀 Getting Started

Prerequisites:

Python 3.10 or higher

Installation

Clone the repository

git clone git@github.com:coffeedrunkpanda/King-County-Housing-Analysis.git
cd King-County-Housing-Analysis

Install dependencies

pip install -r requirements.txt

Usage

Download the data from Kaggle - King County House Sales dataset.

Each notebook builds upon the previous one and should be executed in order (1→6). The project uses a consistent random state (seed=13) for reproducibility.

👤 Authors

Built by me (@coffeedrunkpanda) together with @alexcardenasgutierrez-droid and @sheetansh for the Ironhack bootcamp of Data Science & Machine Learning.

Name		Name	Last commit message	Last commit date
Latest commit History 36 Commits
Images		Images
Metrics		Metrics
Reports		Reports
.gitignore		.gitignore
1_Baseline_Models.ipynb		1_Baseline_Models.ipynb
2_Feature_Importance.ipynb		2_Feature_Importance.ipynb
3_Outliers_Analysis.ipynb		3_Outliers_Analysis.ipynb
4_Feature_Selection.ipynb		4_Feature_Selection.ipynb
5_FeatureEngineering.ipynb		5_FeatureEngineering.ipynb
6_Model_Optimization.ipynb		6_Model_Optimization.ipynb
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

King County House Sales - Machine Learning Price Prediction

📖 Project Overview

🏆 Key Results & Findings

📊 Dataset Description

📂 Project Workflow & Key Results

1. Baseline Models

Baseline Model Performance

2. Feature Importance Analysis

3. Outliers Analysis

Outlier Strategy Comparison (Test Results)

4. Feature Selection

5. Feature Engineering

6. Model Optimization

Dataset versions:

Model Comparison (Test Set Results)

Best Model: Gradient Boosting

🛠️ Technologies Used

🚀 Getting Started

Installation

👤 Authors

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

King County House Sales - Machine Learning Price Prediction

📖 Project Overview

🏆 Key Results & Findings

📊 Dataset Description

📂 Project Workflow & Key Results

1. Baseline Models

Baseline Model Performance

2. Feature Importance Analysis

3. Outliers Analysis

Outlier Strategy Comparison (Test Results)

4. Feature Selection

5. Feature Engineering

6. Model Optimization

Dataset versions:

Model Comparison (Test Set Results)

Best Model: Gradient Boosting

🛠️ Technologies Used

🚀 Getting Started

Installation

👤 Authors

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages