An end-to-end machine learning project using the UCI Heart Disease dataset to predict the presence of heart disease based on clinical and demographic features.
Usage
View the notebook:
View Notebook
View the full report:
View Report
Techniques Used
Preprocessing
- Categorical values encoded numerically (e.g. chest pain type, sex, thal).
- Missing data handled with:
- Mode/Median Imputation
- K-Nearest Neighbors (KNN) Imputation (tuned for optimal
k
)
- StandardScaler used to normalize continuous variables.
- Dropped low-informative features like
fbs
andrestecg
based on Mutual Information.
Models Trained
- Logistic Regression
- Binary classification
- Multiclass classification
- Tuned using regularization strength
C
- Applied PCA for visualization and insight
Results
Model | Accuracy | False Negatives |
---|---|---|
Logistic Regression (Binary, Untuned) | 80% | 23 |
Logistic Regression (Binary, Mean Imputation) | 82% | 18 |
Logistic Regression (Binary, KNN Imputation) | 84% | 15 |
Logistic Regression (Multiclass, Tuned) | 57% | 11 |
Languages
Python (Pandas, scikit-learn, matplotlib, seaborn)