Skip to content

Fa-hub07/Stroke-Risk-Classification-ML-

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Stroke Risk Classification | Using Random Forest Algorithm and SMOTE Mehthod (to overcome dataset imbalance)

Overview

This project focuses on predicting stroke risk using machine learning classification techniques based on demographic, lifestyle, and medical information. The objective is to identify individuals who may have a higher likelihood of experiencing a stroke and evaluate the effectiveness of the Random Forest algorithm in handling healthcare classification problems.


Problem Statement

Stroke is one of the leading causes of death and long-term disability worldwide. Early detection of high-risk individuals can assist healthcare providers in implementing preventive measures and improving patient outcomes.

This project aims to build a classification model capable of predicting stroke occurrence using patient health data.


Dataset

Source: Kaggle Healthcare Stroke Dataset (Link: https://www.kaggle.com/datasets/ranaghulamnabi/stroke-risk-dataset)

The dataset contains demographic and medical information such as:

  • Gender
  • Age
  • Hypertension
  • Heart Disease
  • Marital Status
  • Work Type
  • Residence Type
  • Average Glucose Level
  • BMI
  • Smoking Status

Target Variable

| Value | Description | | 0 | No Stroke | | 1 | Stroke |


Project Workflow

1. Data Understanding

  • Load dataset
  • Examine data structure
  • Identify missing values
  • Check duplicate records

2. Data Preprocessing

  • Remove unnecessary columns (id)
  • Handle missing values
  • Encode categorical variables using Label Encoding

3. Exploratory Data Analysis (EDA)

Visualizations performed:

  • Stroke Distribution
  • Gender Distribution
  • Age Distribution
  • BMI Distribution
  • Smoking Status Distribution

4. Data Splitting

Dataset split into:

  • Training Set (80%)
  • Testing Set (20%)

5. Handling Imbalanced Data

The dataset is highly imbalanced.

To address this issue:

  • SMOTE (Synthetic Minority Oversampling Technique) was applied to the training data.

Benefits:

  • Better representation of minority class
  • Improved model learning
  • Reduced bias toward majority class

6. Model Development

Algorithm used:

Random Forest Classifier

Parameters:

RandomForestClassifier(
    n_estimators=200,
    class_weight='balanced',
    random_state=42
)

Reason for choosing Random Forest:

  • Robust against overfitting
  • Handles non-linear relationships
  • Works well on tabular healthcare data
  • Provides feature importance analysis

Model Evaluation

Evaluation metrics:

  • Accuracy
  • Precision
  • Recall
  • F1-Score
  • Confusion Matrix
  • Feature Importance Analysis
  • ROC and AUC Curve

Stroke Distribution

Stroke Distribution


Confusion Matrix

Confusion Matrix


Feature Importance Analysis

Feature importance was extracted from the trained Random Forest model to identify the most influential factors contributing to stroke prediction.

Feature Importance Visualization

Feature Importance


In Used

  • Python
  • Pandas
  • NumPy
  • Matplotlib
  • Seaborn
  • Scikit-Learn
  • Imbalanced-Learn (SMOTE)
  • Jupyter Notebook or Vs Code (ipynb)

Project Structure

stroke-risk-classification/
│
├── Dataset Kaggle Stroke.csv
│   
│
├── STROKE_Random_Forest.ipynb
│   
│
├── images/
│   ├── stroke_distribution.png
│   ├── confusion_matrix.png
│   ├── feature_importance.png
│   ├── before_smote.png
│   └── after_smote.png
│
├── README.md
│
└── requirements.txt

Key Findings

  • The original dataset contains class imbalance between stroke and non-stroke cases.
  • SMOTE significantly improved minority class representation.
  • Random Forest successfully learned patterns from healthcare attributes (91% Accuration).
  • Feature importance analysis reveals the most influential factors associated with stroke occurrence.

Future Improvements

  • Hyperparameter Tuning
  • Cross Validation
  • XGBoost Comparison
  • LightGBM Comparison
  • Model Deployment using Streamlit
  • Use Ensemble Learning
  • SMOTE + ENN

About

Machine learning projects for stroke risk prediction using healthcare datasets (kaggle) | Random Forest Algorithm and SMOTE Method (to overcome dataset imbalance)

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors