This project builds a machine learning model to predict the presence of heart disease based on clinical and lifestyle features. It supports early diagnosis and risk stratification for patients, helping healthcare providers make informed decisions.
The dataset includes patient-level data with various diagnostic and demographic attributes.
Key columns:
Age: Age of the patientSex: Gender (1 = male, 0 = female)ChestPainType: Type of chest pain (e.g., typical angina, asymptomatic)RestingBP: Resting blood pressure (mm Hg)Cholesterol: Serum cholesterol (mg/dl)FastingBS: Fasting blood sugar > 120 mg/dl (1 = true, 0 = false)RestingECG: Resting electrocardiographic resultsMaxHR: Maximum heart rate achievedExerciseAngina: Exercise-induced angina (1 = yes, 0 = no)Oldpeak: ST depression induced by exerciseST_Slope: Slope of the peak exercise ST segmentHeartDisease: Target variable (1 = disease present, 0 = no disease)
df = pd.read_csv('/kaggle/input/heart-disease-dataset/heart.csv')df = df.dropna()
df_encoded = pd.get_dummies(df, drop_first=True)
X = df_encoded.drop('HeartDisease', axis=1)
y = df_encoded['HeartDisease']X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.2, random_state=42)model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)y_pred = model.predict(X_test)
print(accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))input_df = pd.DataFrame(np.zeros((1, len(X_train.columns))), columns=X_train.columns)
# Fill in values based on confirmed X_train.columns
model.predict(input_df)- Accuracy: ~90–95% on test data
- Robust classification across age, cholesterol, and ECG features
- Top predictors: Chest pain type, ST slope, MaxHR, and ExerciseAngina
numpy
pandas
scikit-learn