With thousands of used cars listed on platforms like CarDekho, it's difficult for sellers to know the right asking price. Pricing too high means no buyers; pricing too low means losing money.
This project builds a Random Forest Regression model that predicts the resale price of a pre-owned vehicle based on key features like engine size, max power, mileage, fuel type, and more — giving sellers a data-driven price suggestion based on real market conditions.
Resale-Car-Price-Prediction/
├── images/
│ ├── selling_price_distribution.png
│ ├── correlation_heatmap.png
│ └── model_comparison.png
│
├── Resale_Car_Prediction.ipynb
├── cardekho_imputated.csv
├── car_price_predictor.pkl
├── preprocessor.pkl
├── .gitignore
└── README.md
scikit-learnpandasnumpymatplotlibseabornjoblib
The dataset is scraped from cardekho.com
- Samples: 15,411
- Features: 13
- Target Variable:
selling_price— resale price in INR
| Feature | Description |
|---|---|
model |
Car model name |
vehicle_age |
Age of the vehicle (in years) |
km_driven |
Total kilometers driven |
seller_type |
Individual / Dealer / Trustmark Dealer |
fuel_type |
Petrol / Diesel / CNG / LPG |
transmission_type |
Manual / Automatic |
mileage |
Fuel efficiency (km/l) |
engine |
Engine displacement (cc) |
max_power |
Maximum power (bhp) |
seats |
Number of seats |
- Load the dataset from
cardekho_imputated.csv - Check for missing values — dataset is pre-imputed, no nulls found
- Drop irrelevant features —
car_nameandbrand(redundant withmodel) - Identify feature types — 7 numerical, 4 categorical
- Exploratory Data Analysis — price distribution and correlation heatmap
- Apply Label Encoding on
modelcolumn (120 unique values) - Train-Test Split — 80% train / 20% test (
random_state=42) - Preprocessing —
OneHotEncoderfor categorical,StandardScalerfor numerical viaColumnTransformer - Train and compare 6 models — Linear Regression, Lasso, Ridge, KNN, Decision Tree, Random Forest
- Hyperparameter tuning using
RandomizedSearchCVwith 100 iterations and 3-fold CV - Evaluate using RMSE, MAE, and R² Score
- Save model and preprocessor using
joblib
Key Insights:
- Most cars are priced between ₹2–15 lakhs — heavily right-skewed distribution
max_powerhas the strongest positive correlation with selling price (0.75)engineis the second strongest predictor (0.59)vehicle_ageandkm_drivenshow negative correlation with price as expectedmileagenegatively correlates with engine size — bigger engines are less fuel efficient
| Model | Train R² | Test R² |
|---|---|---|
| Linear Regression | 0.6218 | 0.6645 |
| Lasso | 0.6218 | 0.6645 |
| Ridge | 0.6218 | 0.6645 |
| K-Neighbors Regressor | 0.8691 | 0.9150 |
| Decision Tree | 0.9995 | 0.8823 |
| Random Forest | 0.9791 | 0.9303 |
Random Forest was selected as the final model — best test R² with no overfitting.
rf_params = {
"max_depth" : [5, 8, 15, None, 10],
"max_features" : [5, 7, "sqrt", 8],
"min_samples_split": [2, 8, 15, 20],
"n_estimators" : [100, 200, 500, 1000]
}Tuning method: RandomizedSearchCV — n_iter=100, cv=3, n_jobs=-1
Best Parameters:
RandomForestRegressor(
n_estimators=1000,
min_samples_split=2,
max_features=7,
max_depth=None
)| Metric | Train | Test |
|---|---|---|
| R² Score | 0.9804 | 0.9403 |
| RMSE | ₹1,26,209 | ₹2,12,015 |
| MAE | ₹38,890 | ₹98,050 |
import joblib
import pandas as pd
rf_model = joblib.load('car_price_predictor.pkl')
preprocessor = joblib.load('preprocessor.pkl')
# Sample car — Ford Ecosport
sample = pd.DataFrame([{
'model' : 38, # label encoded value for Ecosport
'vehicle_age' : 6,
'km_driven' : 30000,
'seller_type' : 'Dealer',
'fuel_type' : 'Diesel',
'transmission_type': 'Manual',
'mileage' : 22.77,
'engine' : 1498,
'max_power' : 98.59,
'seats' : 5
}])
sample_transformed = preprocessor.transform(sample)
predicted_price = rf_model.predict(sample_transformed)
print(f"Predicted Selling Price: ₹ {predicted_price[0]:,.0f}")
# Output: Predicted Selling Price: ₹ 6,08,5001. Clone the repo
git clone https://github.com/AnmolPatel20/Resale-Car-Price-Prediction.git
cd Resale-Car-Price-Prediction2. Install dependencies
pip install pandas numpy matplotlib seaborn scikit-learn joblib3. Run the notebook
jupyter notebook Resale_Car_Prediction.ipynb- Both
car_price_predictor.pklandpreprocessor.pklmust be loaded together for prediction - The preprocessor handles all encoding and scaling — never pass raw data directly to the model
- The
modelcolumn requires the label encoded integer value, not the string name
I'm on my machine learning journey — building, experimenting and documenting as I go. Every notebook here represents something I've genuinely tried to understand, not just run. 🚀
- GitHub: @AnmolPatel20
- Portfolio: anmolpatel20.github.io/Anmol_Portfolio
Thanks to Krish Naik Sir whose Udemy course has been a great resource throughout this learning journey.
"Not all those who wander are lost." — J.R.R. Tolkien
⭐ Star this repo if you found it helpful!



