As a Site Reliability Engineer at Zefr, you'll apply your expertise in cloud infrastructure, CI/CD, Observability, and core SRE concepts to deliver high-quality, reliable, and scalable solutions. You'll work closely with the Engineering and Data Science teams to ensure robust, efficient, and scalable infrastructure for our services.
Requirements
- 7+ year job history designing, managing, deploying, and supporting Cloud Infrastructure in a production environment using major public cloud providers (GCP experience a huge bonus)
- Knowledge of GitOps including an understanding of modern CI/CD pipelines, techniques and technologies (Github Actions, GitLab, CircleCI, Argo CD, Flux)
- Proficiency with IaC and configuration management tools (Terraform, Terragrunt, OpenTofu, Crossplane, Pulumi)
- Production experience architecting, managing, deploying, and supporting container based workloads into Kubernetes clusters
- Strong problem-solving experience, focusing on automation
- Proven track record of building and scaling reliability practices, including SLO/SLI frameworks, incident management, and capacity planning.
- Heavy Production experience with observability platforms and practices (Prometheus, Grafana, Chronosphere, Datadog, OpenTelemetry); ability to design monitoring strategies for complex distributed systems.
- Knowledge of cloud networking (Mesh, NAT, Load Balancers, API Gateways, proxies, etc), cloud security, and cost optimization strategies.
- Strong written and verbal communication, organization, and documentation skills
Benefits
- Flexible PTO
- Medical, dental, and vision insurance with FSA options
- Company-paid life insurance
- Paid parental leave
- 401(k) with company match
- Professional development opportunities
- 10+ paid holidays off
- Summer Fridays (we leave early)
- In-office, hybrid, and fully-remote work options available
- In-office lunches and lots of free food
- Optional in-person and virtual events (we like to celebrate!)
