A complete end-to-end data analysis pipeline on the World Bank Education Statistics dataset β covering data loading, cleaning, EDA, statistical analysis, visualization, MySQL storage, and Excel dashboard export.
This project analyzes global education indicators across 886,930 rows and 69 columns spanning years 1970 to 2075, covering countries worldwide. The goal is to extract meaningful insights on education trends β literacy rates, enrollment rates, government expenditure, and more β using a full data pipeline built in Python.
| Property | Details |
|---|---|
| Name | Education Statistics (EdStatsData.csv) |
| Source | World Bank via Kaggle |
| Kaggle Link | https://www.kaggle.com/datasets/theworldbank/education-statistics |
| Provider | The World Bank Open Data |
| Size | ~300 MB (886,930 rows Γ 69 columns) |
| Coverage | 200+ countries, 4,000+ indicators, 1970β2075 |
| License | World Bank Open Data License (CC BY 4.0) |
β οΈ The raw CSV file is NOT included in this repository due to its large size.
Download it directly from Kaggle using the link above.
| Column | Description |
|---|---|
Country Name |
Full name of the country |
Country Code |
ISO 3-letter country code |
Indicator Name |
Full name of the education indicator |
Indicator Code |
World Bank indicator code |
1970 β 2075 |
Indicator value for each year |
EdStatsAnalysis/
β
βββEdStatsCode.ipynb # Main analysis notebook (11 sections)
βββ EdStats_Dashboard_Data.xlsx # Excel dashboard (3 summary sheets)
βββ README.md # Project documentation
βββ .gitignore # Excludes CSV, cache, checkpoints
β
βββ plot_global_trend.png # Global average value trend (1970β2023)
βββ plot_top_countries.png # Top 10 countries by data availability
βββ plot_distribution.png # Value distribution histogram + KDE
βββ plot_regression.png # Linear regression β actual vs predicted
Kaggle CSV (886k rows)
β
Chunked Memory-Safe Loading (50k rows/chunk)
β
Data Cleaning & Null Handling
β
Wide β Long Format Reshape (melt)
β
NumPy Statistical Analysis
β
Matplotlib & Seaborn Visualizations
β
Linear Regression (sklearn)
β
MySQL Storage (chunked push, 10k rows/chunk)
β
Excel Dashboard Export (3 sheets)
| Section | Description |
|---|---|
| 1 | Import libraries |
| 2 | Memory-safe chunked CSV loading with dtype optimization |
| 3 | Basic EDA β shape, dtypes, missing values |
| 4 | Wide β Long format reshape using melt() |
| 5 | NumPy statistical analysis (mean, median, std, percentiles, decade stats) |
| 6 | Reusable filter helper functions |
| 7 | 5 visualizations β trend, bar, distribution, country filter, heatmap |
| 8 | Linear regression with StandardScaler, RΒ² and RMSE evaluation |
| 9 | Excel export β 3 summarized sheets for dashboard |
| 10 | MySQL push β chunked, memory-safe |
| 11 | Power BI / Excel dashboard guide |
| Sheet | Contents | Best Visual in Excel |
|---|---|---|
Country_Year_Avg |
Average value per country per year | Line chart with country slicer |
Indicator_Summary |
Data count, mean, std per indicator | Bar chart β top indicators |
Decade_Country_Avg |
Average value per country per decade | PivotTable matrix / heatmap |
| Tool | Purpose |
|---|---|
| Python 3.10+ | Core programming language |
| Pandas | Data loading, cleaning, reshaping |
| NumPy | Vectorized statistical computations |
| Matplotlib | Static plotting |
| Seaborn | Statistical visualizations |
| Scikit-learn | Linear regression, scaling, train-test split |
| SQLAlchemy + PyMySQL | MySQL database connection |
| OpenPyXL | Excel file export |
| MySQL 8.0 | Relational database storage |
| Jupyter Notebook | Interactive development environment |
git clone https://github.com/yourusername/edstats-analysis.git
cd edstats-analysispip install pandas numpy matplotlib seaborn scikit-learn sqlalchemy pymysql openpyxl jupyter- Go to: https://www.kaggle.com/datasets/theworldbank/education-statistics
- Download
EdStatsData.csv - Place it anywhere on your system and update
FILE_PATHin Section 2 of the notebook
CREATE DATABASE ed_stats_data_db;
USE ed_stats_data_db;Then update Section 10 in the notebook with your MySQL credentials.
jupyter notebook EdStatsCode.ipynbRun cells top to bottom. Section 10 (MySQL push) is optional β set PUSH_TO_MYSQL = True only when MySQL is configured.
| Original Issue | Fix Applied |
|---|---|
| Dropped Country/Indicator columns | Retained all 4 ID columns throughout |
| X = 1970 col, y = 1971 col (meaningless) | X = Year, y = Value per filtered country+indicator |
| OneHotEncoder on numeric data | Removed β not applicable |
| StandardScaler applied after model fit | Scaler now applied before fitting |
| Plots on future NaN columns (2080β2100) | Only historical years 1970β2023 used |
| No memory management | Chunked loading + dtype spec + del after concat |
| Plot | File |
|---|---|
| Global average indicator trend | ![]() |
| Top 10 countries by data volume | ![]() |
| Value distribution (histogram + KDE) | ![]() |
| Regression β actual vs predicted | ![]() |
This dataset is large (~300MB CSV, 886k rows). The notebook is designed to handle it safely:
- CSV loaded in 50,000 row chunks β prevents RAM crash
dtypespecified upfront β reduces memory by ~40%- Only historical years (1970β2023) used β avoids mostly-NaN future columns
- MySQL push done in 10,000 row chunks β stable on any system
- Excel export uses aggregated summaries only β not raw 886k rows
Astik Mohapatra
B.Tech Computer Science Engineering
Government College of Engineering Keonjhar (CGPA: 8.09, 2026)
π§ astikm7007@gmail.com
π https://linkedin.com/in/astik-mohapatra
π https://github.com/Astik97
This project is licensed under the MIT License.
The dataset is provided by The World Bank under the Creative Commons Attribution 4.0 License (CC BY 4.0).
- The World Bank β for providing open education data
- Kaggle β for hosting the dataset
- IBM / Coursera β inspiration from IBM Data Science curriculum



