- Design, build, and maintain reliable and scalable enterprise-level distributed transactional data processing systems for scaling the existing business and supporting new business initiatives
- Optimize jobs to utilize Kafka, Hadoop, Presto, Spark, and Kubernetes resources in the most efficient way
- Monitor and provide transparency into data quality across systems (accuracy, consistency, completeness, etc)
- Increase accessibility and effectiveness of data (work with analysts, data scientists, and developers to build/deploy tools and datasets that fit their use cases)
- Collaborate within a small team with diverse technology backgrounds
- Provide mentorship and guidance to junior team members
- Ingest, validate and process internal & third party data
- Create, maintain and monitor data flows in Python, Spark, Hive, SQL and Presto for consistency, accuracy and lag time
- Maintain and enhance framework for jobs(primarily aggregate jobs in Spark and Hive)
- Create different consumers for data in Kafka using Spark Streaming for near time aggregation
- Tools evaluation
- Backups/Retention/High Availability/Capacity Planning
- Review/Approval - DDL for database, Hive Framework jobs and Spark Streaming to make sure they meet our standards
- Python - primary repo language
- Airflow/Luigi - for job scheduling
- Docker - Packaged container image with all dependencies
- Graphite - for monitoring data flows
- Hive - SQL data warehouse layer for data in HDFS
- Kafka - distributed commit log storage
- Kubernetes - Distributed cluster resource manager
- Presto/Trino - fast parallel data warehouse and data federation layer
- Spark Streaming - Near time aggregation
- SQL Server - Reliable OLTP RDBMS
- Apache Iceberg
- GCP - BigQuery for performance, Looker for dashboards
Requirements
- 6+ years of data engineering experience
- Fluency in Python and SQL
- Strong recent Spark experience
- Experience working in on-prem environments
- Hadoop and Hive experience
- Experience in Scala/Java is a plus (Polyglot programmer preferred!)
- Proficiency in Linux
- Strong understanding of RDBMS and query optimization
- Passion for engineering and computer science around data
- East Coast U.S. hours 9am-6pm EST; you can work fully remotely
- Notice period needs to be less than 2 months (or 2 months max)
- Knowledge and exposure to distributed production systems i.e Hadoop
- Knowledge and exposure to Cloud migration (AWS/GCP/Azure) is a plus
- We can hire as FTE in the, U.S., UK and Netherlands
- We can hire as long-term contractor (independent or B2B) in most other countries