Stroke Risk Classification | Using Random Forest Algorithm and SMOTE Mehthod (to overcome dataset imbalance)

Overview

This project focuses on predicting stroke risk using machine learning classification techniques based on demographic, lifestyle, and medical information. The objective is to identify individuals who may have a higher likelihood of experiencing a stroke and evaluate the effectiveness of the Random Forest algorithm in handling healthcare classification problems.

Problem Statement

Stroke is one of the leading causes of death and long-term disability worldwide. Early detection of high-risk individuals can assist healthcare providers in implementing preventive measures and improving patient outcomes.

This project aims to build a classification model capable of predicting stroke occurrence using patient health data.

Dataset

Source: Kaggle Healthcare Stroke Dataset (Link: https://www.kaggle.com/datasets/ranaghulamnabi/stroke-risk-dataset)

The dataset contains demographic and medical information such as:

Gender
Age
Hypertension
Heart Disease
Marital Status
Work Type
Residence Type
Average Glucose Level
BMI
Smoking Status

Target Variable

Project Workflow

1. Data Understanding

Load dataset
Examine data structure
Identify missing values
Check duplicate records

2. Data Preprocessing

Remove unnecessary columns (id)
Handle missing values
Encode categorical variables using Label Encoding

3. Exploratory Data Analysis (EDA)

Visualizations performed:

Stroke Distribution
Gender Distribution
Age Distribution
BMI Distribution
Smoking Status Distribution

4. Data Splitting

Dataset split into:

Training Set (80%)
Testing Set (20%)

5. Handling Imbalanced Data

The dataset is highly imbalanced.

To address this issue:

SMOTE (Synthetic Minority Oversampling Technique) was applied to the training data.

Benefits:

Better representation of minority class
Improved model learning
Reduced bias toward majority class

6. Model Development

Algorithm used:

Random Forest Classifier

Parameters:

RandomForestClassifier(
    n_estimators=200,
    class_weight='balanced',
    random_state=42
)

Reason for choosing Random Forest:

Robust against overfitting
Handles non-linear relationships
Works well on tabular healthcare data
Provides feature importance analysis

Model Evaluation

Evaluation metrics:

Accuracy
Precision
Recall
F1-Score
Confusion Matrix
Feature Importance Analysis
ROC and AUC Curve

Stroke Distribution

Confusion Matrix

Feature Importance Analysis

Feature importance was extracted from the trained Random Forest model to identify the most influential factors contributing to stroke prediction.

Feature Importance Visualization

In Used

Python
Pandas
NumPy
Matplotlib
Seaborn
Scikit-Learn
Imbalanced-Learn (SMOTE)
Jupyter Notebook or Vs Code (ipynb)

Project Structure

stroke-risk-classification/
│
├── Dataset Kaggle Stroke.csv
│   
│
├── STROKE_Random_Forest.ipynb
│   
│
├── images/
│   ├── stroke_distribution.png
│   ├── confusion_matrix.png
│   ├── feature_importance.png
│   ├── before_smote.png
│   └── after_smote.png
│
├── README.md
│
└── requirements.txt

Key Findings

The original dataset contains class imbalance between stroke and non-stroke cases.
SMOTE significantly improved minority class representation.
Random Forest successfully learned patterns from healthcare attributes (91% Accuration).
Feature importance analysis reveals the most influential factors associated with stroke occurrence.

Future Improvements

Hyperparameter Tuning
Cross Validation
XGBoost Comparison
LightGBM Comparison
Model Deployment using Streamlit
Use Ensemble Learning
SMOTE + ENN

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
Images		Images
.gitignore		.gitignore
Dataset Kaggle Stroke.csv		Dataset Kaggle Stroke.csv
README.md		README.md
Requirements.txt		Requirements.txt
STROKE_Random_Forest.ipynb		STROKE_Random_Forest.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Stroke Risk Classification | Using Random Forest Algorithm and SMOTE Mehthod (to overcome dataset imbalance)

Overview

Problem Statement