We are looking for a Senior Site Reliability Engineer to join our team. As a Sr. SRE, you will be responsible for ensuring the reliability and performance of our platform, architecting and implementing scalable AWS cloud solutions, and fostering a culture of automation, resilience, and continuous improvement across our engineering teams.
Requirements
- Design and implement reliable, scalable, and efficient cloud infrastructure to support our global platform's growth
- Maintain and support highly loaded Kubernetes (EKS) clusters and infrastructure-related components
- Develop and continuously improve monitoring and logging systems using Prometheus, DataDog, and Loki stacks
- Participate in on-call rotation to support production environment and ensure rapid response to outages
- Lead incident response efforts, ensuring minimal service impact while documenting learnings and implementing preventive measures
- Collaborate with development teams to establish Service Level Objectives (SLOs) and ensure systems meet or exceed reliability targets
- Champion SRE best practices across engineering, mentoring teams on resiliency, performance optimization, and scalability
- Automate platform operations with infrastructure-as-code (Terraform) and configuration management tools
Benefits
- Remote First, Remote Always
- PTO in accordance with local labor requirements
- 2 corporate apartment accommodations for team member use for free (San Diego & São Paulo)
- Monthly Wellness Fridays - enjoy an extra long weekend every month
- Full Paid Parental Leave
- Home office stipend based on country of residency
- Professional development courses in Cloudbeds University
- Access to professional development, including manager training, upskilling and knowledge transfer
