Stroke Risk Classification | Using Random Forest Algorithm and SMOTE Mehthod (to overcome dataset imbalance)
This project focuses on predicting stroke risk using machine learning classification techniques based on demographic, lifestyle, and medical information. The objective is to identify individuals who may have a higher likelihood of experiencing a stroke and evaluate the effectiveness of the Random Forest algorithm in handling healthcare classification problems.
Stroke is one of the leading causes of death and long-term disability worldwide. Early detection of high-risk individuals can assist healthcare providers in implementing preventive measures and improving patient outcomes.
This project aims to build a classification model capable of predicting stroke occurrence using patient health data.
Source: Kaggle Healthcare Stroke Dataset (Link: https://www.kaggle.com/datasets/ranaghulamnabi/stroke-risk-dataset)
The dataset contains demographic and medical information such as:
- Gender
- Age
- Hypertension
- Heart Disease
- Marital Status
- Work Type
- Residence Type
- Average Glucose Level
- BMI
- Smoking Status
| Value | Description | | 0 | No Stroke | | 1 | Stroke |
- Load dataset
- Examine data structure
- Identify missing values
- Check duplicate records
- Remove unnecessary columns (
id) - Handle missing values
- Encode categorical variables using Label Encoding
Visualizations performed:
- Stroke Distribution
- Gender Distribution
- Age Distribution
- BMI Distribution
- Smoking Status Distribution
Dataset split into:
- Training Set (80%)
- Testing Set (20%)
The dataset is highly imbalanced.
To address this issue:
- SMOTE (Synthetic Minority Oversampling Technique) was applied to the training data.
Benefits:
- Better representation of minority class
- Improved model learning
- Reduced bias toward majority class
Algorithm used:
Parameters:
RandomForestClassifier(
n_estimators=200,
class_weight='balanced',
random_state=42
)Reason for choosing Random Forest:
- Robust against overfitting
- Handles non-linear relationships
- Works well on tabular healthcare data
- Provides feature importance analysis
Evaluation metrics:
- Accuracy
- Precision
- Recall
- F1-Score
- Confusion Matrix
- Feature Importance Analysis
- ROC and AUC Curve
Feature importance was extracted from the trained Random Forest model to identify the most influential factors contributing to stroke prediction.
- Python
- Pandas
- NumPy
- Matplotlib
- Seaborn
- Scikit-Learn
- Imbalanced-Learn (SMOTE)
- Jupyter Notebook or Vs Code (ipynb)
stroke-risk-classification/
│
├── Dataset Kaggle Stroke.csv
│
│
├── STROKE_Random_Forest.ipynb
│
│
├── images/
│ ├── stroke_distribution.png
│ ├── confusion_matrix.png
│ ├── feature_importance.png
│ ├── before_smote.png
│ └── after_smote.png
│
├── README.md
│
└── requirements.txt
- The original dataset contains class imbalance between stroke and non-stroke cases.
- SMOTE significantly improved minority class representation.
- Random Forest successfully learned patterns from healthcare attributes (91% Accuration).
- Feature importance analysis reveals the most influential factors associated with stroke occurrence.
- Hyperparameter Tuning
- Cross Validation
- XGBoost Comparison
- LightGBM Comparison
- Model Deployment using Streamlit
- Use Ensemble Learning
- SMOTE + ENN


