- King County House Sales - Machine Learning Price Prediction
A comprehensive machine learning project to predict house prices in King County, Washington, using multiple regression algorithms, feature engineering techniques and outlier flagging.
This project analyzes the King County House Sales dataset containing 21,613 records and 21 features to build accurate price prediction models. The workflow follows a systematic approach from baseline modeling through feature engineering to hyperparameter optimization.
This project successfully improved house price prediction accuracy by ~30% compared to the baseline model.
Best Model: Gradient Boosting Regressor (v1 Dataset)
Final Performance: Rยฒ 0.942 | MAPE 11.7% | RMSE $86,770
Improvement: Significant error reduction from the initial Linear Regression baseline (Rยฒ 0.716, RMSE $191,531).
| Model Stage | Key Technique | Test Rยฒ | Test RMSE |
|---|---|---|---|
| 1. Baseline | Linear Regression (Raw Data) | 0.716 | $191,531 |
| 2. Intermediate | Random Forest (Default) | 0.896 | $115,994 |
| 3. Optimized | Gradient Boosting (Tuned) | 0.942 | $86,770 |
Critical Insight:
Feature engineering (dropping columns, temporal features) provided diminishing returns. The simplest robust dataset (v1: Cleaned + Outlier Flags) paired with a powerful Gradient Boosting model yielded the best generalization, proving that model selection and hyperparameter tuning were the primary drivers of performance.
The dataset includes house sale information with the following key features:
- Target Variable:
price- House sale price - Property Features: bedrooms, bathrooms, sqft_living, sqft_lot, floors, sqft_above, sqft_basement
- Quality Indicators: condition (scale 1-5), grade (scale 1-11), waterfront, view
- Temporal Features: date, yr_built, yr_renovated
- Location Features: zipcode, lat, long
- Neighborhood Metrics: sqft_living15, sqft_lot15 (neighboring properties in 2015)
The project consists of 6 main notebooks executed sequentially:
- Initial data exploration and minimal cleaning
- Date feature transformation (year_sold, month_sold, day_sold)
- Train-test split (80/20, random_state=13)
- Baseline Linear Regression implementation (normalized and non-normalized)
- Baseline Random Forest
- Baseline XGBoost
| Model | Test Rยฒ | Test RMSE |
|---|---|---|
| LinearRegression | 0.716 | 191,531 |
| LinearRegression (normalized) | 0.683 | 202,287 |
| RandomForestRegressor | 0.896 | 115,994 |
| XGBRegressor | 0.902 | 112,860 |
The complete Metrics are available in the link.
- Lasso (L1 Regularization): Identified most influential features - sqft_living (highest), grade, lat, sqft_above, yr_built
- Ridge (L2 Regularization): Analyzed multicollinearity among correlated features
- Random Forest & XGBoost Feature Importance: Both models showed strong agreement on top predictive features (sqft_living, waterfront, grade, lat)
- No features were immediately dropped based on regularization analysis
- Analysis: Identified 11 high-value properties priced above $4M (max: $7.7M).
- Quantile Flagging: Established 99th ($1.96M) and 95th ($1.15M) percentile thresholds, creating binary flags (q_99, q_95) to retain valuable data points.
- Performance: Flagging outliers yielded superior model performance compared to dropping them.
- Leakage Study: Conducted a statistical analysis on data leakage from outlier flagging. Result: No significant performance difference was found between leaked and non-leaked methods;the original pipeline was retained for simplicity, noting the negligible impact.
| Model | Strategy | Test Rยฒ | Test RMSE | Impact |
|---|---|---|---|---|
| Random Forest | Flagging Outliers | 0.932 | 93,599 | Best Performance |
| Random Forest | Dropping Top 1% | 0.874 | 104,088 | Significant drop in accuracy |
| Random Forest | Dropping Top 5% | 0.872 | 76,309* | Lower RMSE due to smaller range |
| XGBoost | Flagging Outliers | 0.926 | 97,799 | Strong performance |
| XGBoost | Dropping Top 1% | 0.889 | 97,618 | Worse Rยฒ fit |
| XGBoost | Dropping Top 5% | 0.881 | 73,599* | Worse Rยฒ fit |
Note: RMSE naturally decreases when dropping expensive houses (Top 5%) because the target range is smaller, but the Rยฒ (model fit) drops significantly, indicating worse predictive power.
The complete Metrics are available in the link.
- Dropped features based on statistical significance and performance testing:
sqft_lot: High OLS p-value, minimal model impact when removedfloors: Similar justificationmonth_sold,day_sold: Low Ridge/Lasso coefficientsyr_renovated: Replaced with engineered features
- Final feature set: 17 features after dropping 4 low-impact variables
- Models showed slight improvements or maintained performance after feature reduction
- Created new temporal features:
Age_at_Sale: Calculated from yr_builtYear_since_Renovation: Time since last renovationwas_renovated: Binary indicator for renovation history
- Dropped redundant features: id, date, yr_built, year_sold, month_sold
- Model Performance:
- Linear Regression: Rยฒ 0.70 (train), 0.72 (test)
- Random Forest: Rยฒ 0.98 (train), 0.90 (test) - showing slight overfitting
- XGBoost: Comparable performance to Random Forest
We performed hyperparameter tuning using RandomizedSearchCV (50 iterations, 3-fold CV) on three dataset versions.
- v1: Baseline clean dataset with outlier flags (quantiles 99 and 95).
- v2: Feature selection over v1.
- v3: Feature engineering over v2.
| Dataset Version | Model | Test Rยฒ | Test RMSE | Test MAPE |
|---|---|---|---|---|
| v1 (Outliers Only) | Gradient Boosting | 0.942 | 86,770 | 11.7% |
| v1 (Outliers Only) | XGBoost | 0.937 | 90,537 | 11.6% |
| v2 (Feature Drop) | Gradient Boosting | 0.937 | 90,424 | 11.9% |
| v2 (Feature Drop) | XGBoost | 0.936 | 91,140 | 11.7% |
| v3 (Feature Eng.) | XGBoost | 0.936 | 91,103 | 11.9% |
| v3 (Feature Eng.) | Gradient Boosting | 0.935 | 91,838 | 12.4% |
- Best parameters: {'subsample': 0.8, 'n_estimators': 500, 'min_samples_split': 10, 'min_samples_leaf': 4, 'max_features': 'log2', 'max_depth': 5, 'learning_rate': 0.1}
- Performance: Rยฒ 0.9752 (train), 0.9417 (test)
- Metrics: MAE 57,844, RMSE 86,770, MAPE 11.7%
- Dataset: v1
The complete Metrics are available in the Metrics folder for: v1, v2, v3.
- Python Libraries: pandas, numpy, scikit-learn, xgboost, statsmodels, seaborn, matplotlib, plotly
- Models: Linear Regression, Ridge, Lasso, Random Forest, XGBoost, Gradient Boosting
- Techniques: StandardScaler, MinMaxScaler, RandomizedSearchCV, train-test split
- Evaluation Metrics: Rยฒ, Adjusted Rยฒ, MAE, RMSE, MAPE
Prerequisites:
- Python 3.10 or higher
- Clone the repository
git clone git@github.com:coffeedrunkpanda/King-County-Housing-Analysis.git
cd King-County-Housing-Analysis- Install dependencies
pip install -r requirements.txt- Usage
Download the data from Kaggle - King County House Sales dataset.
Each notebook builds upon the previous one and should be executed in order (1โ6). The project uses a consistent random state (seed=13) for reproducibility.
Built by me (@coffeedrunkpanda) together with @alexcardenasgutierrez-droid and @sheetansh for the Ironhack bootcamp of Data Science & Machine Learning.

