Site Reliability Engineer opportunity to build a reliability practice from the ground up, establish engineering standards, and ensure AI workloads meet high reliability standards. Autonomy to set strategy and trust to execute it.
Requirements
- Define SLIs and SLOs for critical user journeys
- Run live production incident response as an Incident Commander
- Build observability that tells a story
- Design error-budget policies
- Deep Technical Expertise in AWS
- Mentor engineers through pair debugging and postmortem coaching
- Communicate scheduled downtime and infrastructure changes to stakeholders proactively
- Act as the recognized Subject Matter Expert for AWS-related questions
Benefits
- Paid Time Off
- 401k Matching
- Retirement Plan
- Generous Parental Leave
