Core Responsibilities
Team Leadership & Operational Management
- Run the daily operations of the SRE practice: team planning, shift assignments, escalation routing, and workload balancing.
- Maintain a healthy on-call program: define rotation rules, track fatigue, ensure coverage, and continuously improve response maturity.
- Oversee incident management processes—ensuring consistent triage, high-quality postmortems, and follow-through on remediation work.
- Establish operational KPIs for the team (MTTA, MTTR, on-call load, ticket aging, toil reduction) and drive accountability.
- Coach and develop SREs at all levels through 1:1s, technical guidance, and structured growth plans.
- Ensure the team’s processes, documentation, and runbooks stay current and audited.
Technical Oversight
- Provide architecture-level guidance on resilience, observability, and reliability patterns; step in directly when the team is blocked or customer-impacting work demands senior technical judgment.
- Validate SLIs/SLOs and error budgets across services; ensure consistent implementation and reporting.
- Review and approve reliability design work—monitoring strategies, automation initiatives, CI/CD changes, deployment safety controls, and cloud cost/performance optimizations.
- Participate in high-severity incidents as escalation point and technical lead when needed.
- Ensure engineering quality for IaC, CI/CD, observability instrumentation, and Kubernetes platform operations.
Cross-Functional Leadership
- Act as primary point of contact for internal stakeholders (Dev, Product, Architecture, Cloud) regarding reliability strategy and prioritization.
- Translate business priorities into reliability roadmaps, staffing plans, and operational improvements.
- Align teams around shared reliability objectives—ensuring corrective actions, automation priorities, and capacity planning are actually executed.
- Support customer-facing conversations when reliability posture, operational processes, or technical improvements require leadership representation.
Required Qualifications
- 6–10 years in SRE/Operations/Platform roles, with at least 2 years leading or managing engineers.
- Hands-on technical background across cloud platforms (AWS/Azure/GCP) and Kubernetes.
- Experience defining and operating SLIs/SLOs, incident response, and postmortem programs.
- Strong grounding in Terraform or similar IaC, CI/CD systems, and observability technologies (Prometheus, Grafana, OpenTelemetry, ELK).
- Ability to assess technical work, coach engineers through complex problems, and make informed trade-offs under pressure.
- Excellent operational judgment: triage, prioritization, team load balancing, and process design.
- Cloud provider certification: Professional-level certification in AWS (Solutions Architect), Azure (Solutions Architect Expert), GCP (Professional Cloud Architect), or Oracle Cloud (Architect Professional)
Nice-to-Have
- Prior experience running a distributed or follow-the-sun SRE practice.
- Exposure to chaos engineering, fault injection, or reliability stress testing.
- Familiarity with cloud cost governance and rightsizing strategies.
- Experience improving or scaling on-call systems.
