Data Engineer is responsible for designing, building, and maintaining the infrastructure and systems required for collecting, storing, and processing large datasets efficiently.
Education: Bachelor's degree in computer science with 8+ years of experience
Experience:
- Technical Skills
- Programming Languages: Proficiency in Python, SQL, Java, or Scala for data manipulation and pipeline development.
- Data Processing Frameworks: Experience with tools like Apache Spark, Hadoop, or Apache Kafka for large-scale data processing.
- Data Systems and Platforms
- Databases: Knowledge of both relational databases (e.g., MySQL, PostgreSQL) and NoSQL databases (e.g., MongoDB, Cassandra).
- Data Warehousing: Experience with platforms like Snowflake, Amazon Redshift and Azure Synapse.
- Cloud Platforms: Familiarity with AWS, Azure Cloud for deploying and managing data pipelines. Having Good experience in Fabric is advantageous
- Experience working with distributed computing systems such as Hadoop HDFS, Hive, or Spark.
- Managing and optimizing data lakes and delta lakes for structured and unstructured data.
- Data Modeling and Architecture
- Expertise in designing efficient data models (e.g., star schema, snowflake schema) and maintaining data integrity.
- Understanding of modern data architectures like Data Mesh or Lambda Architecture.
- Data Pipeline Development
- Building and automating ETL/ELT pipelines for extracting data from diverse sources, transforming it, and loading it into target systems.
- Monitoring and troubleshooting pipeline performance and failures.
- Workflow Orchestration
- Hands-on experience with orchestration tools such as Azure Data Factory, AWS Glue jobs, DMS or Prefect to schedule and manage workflows.
- Version Control and CI/CD
- Utilizing Git for version control and implementing CI/CD practices for data pipeline deployments.
Key Skills:
- Proficiency in programming languages such as Python, SQL, and optionally Scala or Java.
- Proficiency in data processing frameworks like Apache Spark and Hadoop is crucial for handling large-scale and real-time data.
- Expertise in ETL/ELT tools such as Azure ADF and Fabric Data Pipeline is important for creating efficient and scalable data pipelines.
- A solid understanding of database systems, including relational databases like MySQL and PostgreSQL, as well as NoSQL solutions such as MongoDB and Cassandra, is fundamental.
- Experience with cloud platforms, including AWS, Azure and their data-specific services like S3, BigQuery, and Azure Data Factory, is highly valuable.
- Data modeling skills, including designing star or snowflake schema, and knowledge of modern architectures like Lambda and Data Mesh, are critical for building scalable solutions.
Role and Responsibilities:
- Responsible for designing, developing, and maintaining data pipelines and infrastructure to support our data-driven decision-making processes.
- Design, build, and maintain data pipelines to extract, transform, and load data from various sources into our data warehouse and data lake.
- Proficient in creating data bricks creating notebooks, working with catalogs, native SQL, creating clusters, Parameterizing notebooks, and administrating data bricks. Define security models and assign roles as per requirement.
- Responsible for creating data flow in Synapse analytics integrating external source systems, creating external tables, data flows and create data models. Schedule the pipelines using various jobs, creating trigger
- Design and develop data pipelines using Fabric pipelines, spark notebooks accessing multiple data sources. Proficient in developing Data bricks notebooks and data optimization
- Develop and implement data models to ensure data integrity and consistency. Manage and optimize data storage solutions, including databases and data warehouses.
- Develop and implement data quality checks and validation procedures to ensure data accuracy and reliability.
- Design and implement data infrastructure components, including data pipelines, data lakes, and data warehouses.
- Collaborate with data scientists, analysts, and other stakeholders to understand business requirements and translate them into technical solutions.
- Monitoring Azure and Fabric data pipelines, spark jobs and work on fixes based on the request priority.
- Responsible for data monitoring activities, having good knowledge on reporting tools like Power Bi and Tableau is required.
- Responsible for understanding the client requirements and architect solutions in both Azure and AWS cloud platforms.
- Monitor and optimize data pipeline performance and scalability to ensure efficient data processing.