Skip to content

fuadonates/data-engineering

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

3 Commits
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Data Engineering Projects

License: MIT Python Apache Spark SQL Apache Airflow

A comprehensive collection of data engineering projects, proofs-of-concept (POCs), and proofs-of-knowledge (POKs) demonstrating expertise in building scalable data pipelines, ETL/ELT solutions, and data platform architectures using Python, Spark, SQL, and cloud platforms.


πŸ“‹ Table of Contents


🎯 About

This repository serves as a portfolio of my data engineering work, showcasing practical implementations of:

  • ETL/ELT Pipelines: End-to-end data pipelines from ingestion to consumption
  • Big Data Processing: Large-scale data processing using distributed computing
  • Data Warehouse Design: Dimensional modeling and data warehouse architectures
  • Cloud Data Platforms: Cloud-native data solutions on AWS, Azure, and GCP
  • Data Quality & Governance: Data validation, monitoring, and governance frameworks
  • Real-time Processing: Stream processing and event-driven architectures

Experience Level: 6+ years in Data Engineering & Analytics


✨ Key Features

Data Pipeline Development

  • βœ… Batch and real-time data ingestion pipelines
  • βœ… Incremental and full-load strategies
  • βœ… Error handling and retry mechanisms
  • βœ… Pipeline orchestration and scheduling
  • βœ… Data quality validation and monitoring

Scalable Architecture

  • βœ… Distributed data processing with Apache Spark
  • βœ… Cloud-native architectures (AWS, Azure, GCP)
  • βœ… Containerized applications with Docker
  • βœ… Infrastructure as Code (Terraform)
  • βœ… Microservices-based data platforms

Best Practices

  • βœ… Modular and reusable code
  • βœ… Comprehensive testing (unit, integration, E2E)
  • βœ… CI/CD pipeline integration
  • βœ… Detailed documentation
  • βœ… Performance optimization

πŸ› οΈ Technology Stack

Programming Languages

  • Python 3.9+ - Primary language for data engineering
  • SQL - Data transformation and querying
  • Bash - Shell scripting and automation

Data Processing

  • Apache Spark - Distributed data processing (PySpark, Spark SQL)
  • Pandas - Data manipulation and analysis
  • dbt - Data transformation framework
  • Apache Kafka - Stream processing

Orchestration & Workflow

  • Apache Airflow - Workflow orchestration and scheduling
  • Prefect - Modern workflow management

Cloud Platforms

  • AWS - S3, Glue, EMR, Lambda, Redshift, Athena
  • Azure - Data Factory, Databricks, Synapse Analytics
  • GCP - BigQuery, Dataflow, Cloud Composer

Data Warehouses

  • Snowflake - Cloud data warehouse
  • Amazon Redshift - AWS data warehouse
  • Google BigQuery - Serverless data warehouse
  • PostgreSQL - Relational database

πŸ“Š Projects

1. Scalable ETL Pipeline Framework

A production-ready ETL framework for ingesting data from multiple sources, transforming it, and loading it into a data warehouse.

Tech Stack: Python, Apache Airflow, Snowflake, AWS S3
Key Features:

  • Configurable data sources (APIs, databases, files)
  • Incremental data loading strategies
  • Data quality validation framework
  • Error handling and alerting
  • Performance monitoring and logging

2. Real-time Data Streaming Pipeline

Stream processing pipeline for handling real-time data events using Apache Kafka and Spark Streaming.

Tech Stack: Apache Kafka, Apache Spark (Streaming), Python, AWS
Key Features:

  • Real-time event ingestion
  • Stream processing and aggregation
  • Windowing operations
  • Fault tolerance and exactly-once semantics

3. Data Lake Architecture

Implementation of a modern data lake architecture with multiple processing layers (Bronze, Silver, Gold).

Tech Stack: AWS S3, AWS Glue, Apache Spark, Python
Key Features:

  • Raw data ingestion (Bronze layer)
  • Cleansed and conformed data (Silver layer)
  • Business-ready aggregations (Gold layer)
  • Metadata management
  • Data cataloging and discovery

4. Cloud Data Warehouse Migration

Migration project from on-premises database to cloud data warehouse with minimal downtime.

Tech Stack: PostgreSQL, Snowflake, Python, dbt
Key Features:

  • Schema conversion and optimization
  • Data migration scripts
  • Validation and reconciliation
  • Performance tuning

5. Data Quality Monitoring Framework

Automated data quality monitoring system with configurable rules and alerting.

Tech Stack: Python, Great Expectations, Apache Airflow
Key Features:

  • Configurable quality rules
  • Automated validation checks
  • Quality scorecards and dashboards
  • Alerting and notifications

🌟 Skills Demonstrated

Data Engineering Core

  • ETL/ELT pipeline design and development
  • Data modeling (dimensional, normalized, denormalized)
  • Data warehouse design and optimization
  • Big data processing with Spark
  • Stream processing and real-time analytics

Cloud & Infrastructure

  • Cloud-native architecture design
  • Infrastructure as Code (Terraform)
  • Containerization and orchestration
  • Serverless computing
  • Cost optimization strategies

Data Quality & Governance

  • Data validation frameworks
  • Data lineage tracking
  • Metadata management
  • Data cataloging
  • Quality metrics and monitoring

πŸš€ Getting Started

Prerequisites

  • Python 3.9 or higher
  • Docker (optional, for containerized environments)
  • Cloud account (AWS, Azure, or GCP) - for cloud-based projects
  • Apache Spark (for Spark-based projects)

Installation

  1. Clone the repository:

    git clone https://gitlab.com/fuadonates/data-engineering.git
    cd data-engineering
  2. Create virtual environment:

    python -m venv venv
    source venv/bin/activate  # On Windows: venv\Scripts\activate
  3. Install dependencies:

    pip install -r requirements.txt

🀝 Contributing

While this is primarily a portfolio repository, contributions and suggestions are welcome!

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/AmazingFeature)
  3. Commit your changes (git commit -m 'Add some AmazingFeature')
  4. Push to the branch (git push origin feature/AmazingFeature)
  5. Open a Pull Request

πŸ“ License

This project is licensed under the MIT License - see the LICENSE file for details.


πŸ“§ Contact

Fuad - Senior Data Engineer & Data Analyst

LinkedIn GitHub Email


If you found this repository helpful, please give it a ⭐️!

Made with ❀️ by Fuad

About

A collection of data engineering projects, proofs-of-concept (POCs), and proofs-of-knowledge (POKs) using technologies like Python, Spark, SQL, and cloud platforms.

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors