What will you do?
- Design, deploy, and operate Kubernetes clusters across AWS, Azure, and GCP. Optimize cluster performance, ensure high availability, and implement robust security practices.
- Build and maintain cloud-native infrastructure components (load balancers, networking, storage, etc.) to support applications running on Kubernetes. Leverage Infrastructure as Code (IaC) with Terraform to automate and manage infrastructure provisioning and configuration.
- Embrace GitOps principles using ArgoCD to automate deployments and configuration changes and ensure consistency between the desired and actual system state.
- Establish comprehensive monitoring, logging, and alerting systems to gain insights into platform health and performance. Troubleshoot incidents swiftly and apply SRE principles to improve reliability and resilience.
- Develop automation scripts and tools (Python, Go, or other languages) to streamline workflows, eliminate manual tasks, and reduce operational overhead.
- Partner closely with development teams to understand their needs, provide guidance on platform best practices, and enable smooth integration and deployment of their applications.
- Implement and maintain stringent security measures for Kubernetes and cloud environments, ensuring compliance with industry standards and data protection regulations.
- Analyze resource usage and implement optimization strategies to maximize performance while controlling cloud costs.
- Participate in an on-call rotation, troubleshooting and resolving production issues promptly.
What makes you a match?
- 3+ years of experience working with Kubernetes in production environments. Deep understanding of cluster operations, networking, storage, and security within Kubernetes.
- Strong knowledge of AWS, Azure, and GCP, including core services, networking concepts, and security best practices.
- Proven experience implementing GitOps workflows with ArgoCD and managing infrastructure using Terraform.
- Fluency in at least one programming language (Python, Go, Java) for automation, scripting, and tool development.
- Familiarity with SRE practices like SLOs (Service Level Objectives), error budgeting, and blameless postmortems.
- Excellent analytical and troubleshooting skills to identify and resolve issues in complex cloud environments.
- Ability to communicate effectively with development, operations, and security teams to drive cross-functional initiatives.
- Ability to work from 8.30 PM to 5.30 AM IST to provide coverage for US time zones.