Crunchafi is looking for a Site Reliability Engineer to ensure the availability, performance, and scalability of our cloud-based SaaS platform. This role bridges software engineering and operations.
Requirements
- Design, build, and maintain scalable and resilient infrastructure on Microsoft Azure to support production SaaS workloads
- Define and track service level objectives (SLOs), service level indicators (SLIs), and error budgets to drive reliability decisions
- Build and maintain comprehensive monitoring, alerting, and observability systems to ensure early detection of issues
- Develop and maintain CI/CD pipelines using GitHub Actions to enable safe, rapid, and repeatable deployments
- Lead incident response and on-call rotations, conduct blameless post-incident reviews, and drive follow-up action items to completion
- Automate operational tasks and eliminate toil through scripting, infrastructure-as-code, and self-healing systems
- Manage and optimize Azure Kubernetes Service (AKS) clusters, container orchestration, and related networking and storage configurations
- Collaborate with software engineering teams to embed reliability into application architecture, including capacity planning, load testing, and chaos engineering
- Maintain and improve infrastructure-as-code using tools such as Terraform, Bicep, or ARM templates
- Partner cross-functionally with Product, Support, and Quality to reduce friction and accelerate delivery
Benefits
- Competitive salary
- Health, dental, and vision plans
- 401(k) Retirement savings plan for US-based employees
- Unlimited PTO
- Significant professional development growth opportunities
