🎓 World Bank Education Statistics — Data Analysis Pipeline

A complete end-to-end data analysis pipeline on the World Bank Education Statistics dataset — covering data loading, cleaning, EDA, statistical analysis, visualization, MySQL storage, and Excel dashboard export.

📌 Project Overview

This project analyzes global education indicators across 886,930 rows and 69 columns spanning years 1970 to 2075, covering countries worldwide. The goal is to extract meaningful insights on education trends — literacy rates, enrollment rates, government expenditure, and more — using a full data pipeline built in Python.

📂 Dataset

Property	Details
Name	Education Statistics (EdStatsData.csv)
Source	World Bank via Kaggle
Kaggle Link	https://www.kaggle.com/datasets/theworldbank/education-statistics
Provider	The World Bank Open Data
Size	~300 MB (886,930 rows × 69 columns)
Coverage	200+ countries, 4,000+ indicators, 1970–2075
License	World Bank Open Data License (CC BY 4.0)

⚠️ The raw CSV file is NOT included in this repository due to its large size.
Download it directly from Kaggle using the link above.

Dataset Columns

Column	Description
`Country Name`	Full name of the country
`Country Code`	ISO 3-letter country code
`Indicator Name`	Full name of the education indicator
`Indicator Code`	World Bank indicator code
`1970` – `2075`	Indicator value for each year

🗂️ Repository Structure

EdStatsAnalysis/
│
├──EdStatsCode.ipynb          # Main analysis notebook (11 sections)
├── EdStats_Dashboard_Data.xlsx   # Excel dashboard (3 summary sheets)
├── README.md                     # Project documentation
├── .gitignore                    # Excludes CSV, cache, checkpoints
│
├── plot_global_trend.png         # Global average value trend (1970–2023)
├── plot_top_countries.png        # Top 10 countries by data availability
├── plot_distribution.png         # Value distribution histogram + KDE
└── plot_regression.png           # Linear regression — actual vs predicted

🔁 Pipeline

Kaggle CSV (886k rows)
        ↓
Chunked Memory-Safe Loading (50k rows/chunk)
        ↓
Data Cleaning & Null Handling
        ↓
Wide → Long Format Reshape (melt)
        ↓
NumPy Statistical Analysis
        ↓
Matplotlib & Seaborn Visualizations
        ↓
Linear Regression (sklearn)
        ↓
MySQL Storage (chunked push, 10k rows/chunk)
        ↓
Excel Dashboard Export (3 sheets)

📓 Notebook Sections

Section	Description
1	Import libraries
2	Memory-safe chunked CSV loading with dtype optimization
3	Basic EDA — shape, dtypes, missing values
4	Wide → Long format reshape using `melt()`
5	NumPy statistical analysis (mean, median, std, percentiles, decade stats)
6	Reusable filter helper functions
7	5 visualizations — trend, bar, distribution, country filter, heatmap
8	Linear regression with StandardScaler, R² and RMSE evaluation
9	Excel export — 3 summarized sheets for dashboard
10	MySQL push — chunked, memory-safe
11	Power BI / Excel dashboard guide

📊 Excel Dashboard Sheets

Sheet	Contents	Best Visual in Excel
`Country_Year_Avg`	Average value per country per year	Line chart with country slicer
`Indicator_Summary`	Data count, mean, std per indicator	Bar chart — top indicators
`Decade_Country_Avg`	Average value per country per decade	PivotTable matrix / heatmap

🛠️ Tech Stack

Tool	Purpose
Python 3.10+	Core programming language
Pandas	Data loading, cleaning, reshaping
NumPy	Vectorized statistical computations
Matplotlib	Static plotting
Seaborn	Statistical visualizations
Scikit-learn	Linear regression, scaling, train-test split
SQLAlchemy + PyMySQL	MySQL database connection
OpenPyXL	Excel file export
MySQL 8.0	Relational database storage
Jupyter Notebook	Interactive development environment

⚙️ Setup & Installation

1. Clone the repository

git clone https://github.com/yourusername/edstats-analysis.git
cd edstats-analysis

2. Install dependencies

pip install pandas numpy matplotlib seaborn scikit-learn sqlalchemy pymysql openpyxl jupyter

3. Download the dataset

Go to: https://www.kaggle.com/datasets/theworldbank/education-statistics
Download EdStatsData.csv
Place it anywhere on your system and update FILE_PATH in Section 2 of the notebook

4. Set up MySQL (optional — for Section 10)

CREATE DATABASE ed_stats_data_db;
USE ed_stats_data_db;

Then update Section 10 in the notebook with your MySQL credentials.

5. Run the notebook

jupyter notebook EdStatsCode.ipynb

Run cells top to bottom. Section 10 (MySQL push) is optional — set PUSH_TO_MYSQL = True only when MySQL is configured.

🔑 Key Fixes from Original Notebook

Original Issue	Fix Applied
Dropped Country/Indicator columns	Retained all 4 ID columns throughout
X = 1970 col, y = 1971 col (meaningless)	X = Year, y = Value per filtered country+indicator
OneHotEncoder on numeric data	Removed — not applicable
StandardScaler applied after model fit	Scaler now applied before fitting
Plots on future NaN columns (2080–2100)	Only historical years 1970–2023 used
No memory management	Chunked loading + dtype spec + del after concat

📈 Sample Visualizations

Plot	File
Global average indicator trend
Top 10 countries by data volume
Value distribution (histogram + KDE)
Regression — actual vs predicted

⚠️ Memory Safety Notes

This dataset is large (~300MB CSV, 886k rows). The notebook is designed to handle it safely:

CSV loaded in 50,000 row chunks — prevents RAM crash
dtype specified upfront — reduces memory by ~40%
Only historical years (1970–2023) used — avoids mostly-NaN future columns
MySQL push done in 10,000 row chunks — stable on any system
Excel export uses aggregated summaries only — not raw 886k rows

👤 Author

Astik Mohapatra
B.Tech Computer Science Engineering
Government College of Engineering Keonjhar (CGPA: 8.09, 2026)
📧 astikm7007@gmail.com
🔗 https://linkedin.com/in/astik-mohapatra
🐙 https://github.com/Astik97

📄 License

This project is licensed under the MIT License.
The dataset is provided by The World Bank under the Creative Commons Attribution 4.0 License (CC BY 4.0).

🙏 Acknowledgements

The World Bank — for providing open education data
Kaggle — for hosting the dataset
IBM / Coursera — inspiration from IBM Data Science curriculum

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🎓 World Bank Education Statistics — Data Analysis Pipeline

📌 Project Overview

📂 Dataset

Dataset Columns

🗂️ Repository Structure

🔁 Pipeline

📓 Notebook Sections

📊 Excel Dashboard Sheets

🛠️ Tech Stack

⚙️ Setup & Installation

1. Clone the repository

2. Install dependencies

3. Download the dataset

4. Set up MySQL (optional — for Section 10)

5. Run the notebook

🔑 Key Fixes from Original Notebook

📈 Sample Visualizations

⚠️ Memory Safety Notes

👤 Author

📄 License

🙏 Acknowledgements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.gitignore		.gitignore
EdStatsCode.ipynb		EdStatsCode.ipynb
EdStats_Dashboard_Data.xlsx		EdStats_Dashboard_Data.xlsx
README.md		README.md
plot_distribution.png		plot_distribution.png
plot_global_trend.png		plot_global_trend.png
plot_india_literacy.png		plot_india_literacy.png
plot_regression.png		plot_regression.png
plot_top_countries.png		plot_top_countries.png

Folders and files

Latest commit

History

Repository files navigation

🎓 World Bank Education Statistics — Data Analysis Pipeline

📌 Project Overview

📂 Dataset

Dataset Columns

🗂️ Repository Structure

🔁 Pipeline

📓 Notebook Sections

📊 Excel Dashboard Sheets

🛠️ Tech Stack

⚙️ Setup & Installation

1. Clone the repository

2. Install dependencies

3. Download the dataset

4. Set up MySQL (optional — for Section 10)

5. Run the notebook

🔑 Key Fixes from Original Notebook

📈 Sample Visualizations

⚠️ Memory Safety Notes

👤 Author

📄 License

🙏 Acknowledgements

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages