Open to opportunities

John Gitau

@johngitau

Message

Skilled Data Engineer passionate about building high-throughput data pipelines.

Kenya

Message

What I'm looking for

I am looking for a role that fosters innovation and collaboration, where I can leverage my data engineering skills to drive impactful insights and contribute to meaningful projects.

I am a skilled Data Engineer with a passion for architecting and building large-scale, high-throughput data pipelines. My experience includes designing batch and real-time systems, optimizing data retrieval and storage, and delivering intuitive analytics tools that drive business value. I have a proven track record of implementing Apache Spark-based ETL pipelines that process large-scale public health datasets, significantly reducing runtime and enhancing data transformation efficiency.

In my current role at Global Programs for Research and Training, I engineered a data lake architecture for storing and retrieving anonymized patient data, which laid the groundwork for future AWS S3 integration. I have also automated data validation processes, ensuring compliance with data governance standards. My adaptability across different programming languages, particularly Python and Scala, has allowed me to contribute effectively to various projects, streamlining deployment pipelines and enhancing overall data engineering practices.

Experience

Work history, roles, and key accomplishments

Current

Data Engineer

Current

Global Programs for Research and Training Affiliate of The U

Jul 2022 - Present (3 years 5 months)

Designed and implemented Apache Spark-based ETL pipelines for large-scale public health datasets, optimizing job runtime and enabling near-real-time insights. Engineered a data lake architecture using S3-compatible storage and automated data validation, reducing errors. Contributed to Scala-based Spark jobs and streamlined deployment pipelines by integrating automated testing.

Apache Spark Scala Hadoop AWS S3 AWS Glue Apache Airflow

Mid-Level Business Intelligence Developer

Global Programs for Research and Training Affiliate of the U

Jul 2021 - Jun 2022 (11 months)

Authored complex SQL-based ETL stored procedures to flatten multi-source surveillance tables, significantly improving query performance. Pioneered the use of Apache Spark with Airflow for automated daily data ingestion, achieving sub-hourly SLAs. Assisted in proof-of-concept migration to AWS Glue and developed reusable Python modules for data validation and transformation.

SQL Apache Spark Apache Airflow Hadoop AWS Glue AWS S3 PostgreSQL Python DHIS2 API DHIS2 APIs

Internship

Global Programs for Research and Training

Apr 2021 - Jun 2021 (2 months)

Conducted routine Moodle data analysis for over 5000 users and served as a Moodle learning management system administrator. Wrote ETL scripts using an in-house SQL tool to load data into analysis databases. Developed a Power BI dashboard to track course uptake and provided technical assistance on the learning management system.

Moodle SQL ETL Power BI Learning Management Systems Data Analysis