Company Overview
[$COMPANY_OVERVIEW]
Role Overview
We are looking for a seasoned Site Reliability Engineer (SRE) Manager to lead our SRE team at [$COMPANY_NAME]. In this critical role, you will be responsible for overseeing the reliability, availability, and performance of our production systems, while driving best practices in operational excellence and team development. Your leadership will be pivotal in aligning SRE principles with our organizational goals, ensuring that we deliver high-quality, reliable services to our customers.
Responsibilities
- Lead and mentor a team of Site Reliability Engineers, fostering a culture of collaboration, innovation, and continuous improvement
- Develop and implement strategies to enhance system reliability, performance, and scalability, utilizing metrics and monitoring tools
- Collaborate closely with development teams to define SLAs, SLOs, and SLIs, ensuring alignment with business objectives
- Oversee incident response processes, ensuring effective communication and resolution of production issues
- Drive the adoption of automation and infrastructure as code practices to streamline operational workflows
- Participate in on-call rotations and develop a robust incident management framework to minimize downtime
Required and Preferred Qualifications
Required:
- 5+ years of experience in Site Reliability Engineering or related fields, with a proven track record of managing teams
- Strong understanding of cloud infrastructure (AWS, Azure, or GCP) and container orchestration technologies (Kubernetes, Docker)
- Hands-on experience with monitoring and logging tools such as Prometheus, Grafana, ELK Stack, or similar
- Demonstrated ability to drive operational excellence through automation and process improvement
Preferred:
- Experience with infrastructure as code tools like Terraform or CloudFormation
- Familiarity with CI/CD pipelines and tools (Jenkins, GitLab CI/CD, etc.)
- Knowledge of incident management frameworks and ITIL best practices
- Previous experience in a leadership role within a fast-paced tech environment
Technical Skills and Relevant Technologies
- Expertise in Linux/Unix system administration and troubleshooting
- Proficiency in scripting languages such as Python, Go, or Bash
- In-depth understanding of networking concepts and protocols (TCP/IP, DNS, HTTP, etc.)
- Experience with database management systems (SQL and NoSQL)
Soft Skills and Cultural Fit
- Exceptional leadership and team management skills, with a focus on developing talent
- Effective communication skills to convey complex technical concepts to non-technical stakeholders
- Strong problem-solving abilities and the capacity to work under pressure
- Passion for building reliable systems and improving the user experience
- A collaborative mindset with a strong belief in the importance of teamwork
Benefits and Perks
Annual salary range: [$SALARY_RANGE]
Additional benefits may include:
- Comprehensive health, dental, and vision insurance
- 401(k) plan with company matching
- Generous paid time off and holidays
- Professional development opportunities
- Flexible work hours and a supportive work environment
Equal Opportunity Statement
[$COMPANY_NAME] is committed to fostering a diverse and inclusive workplace. We are an Equal Opportunity Employer and all qualified applicants will receive consideration for employment without regard to race, color, religion, sex, national origin, disability, veteran status, or any other characteristic protected by law.
Location
This role requires successful candidates to be based in-person at our headquarters in [$COMPANY_LOCATION].
We encourage applicants from diverse backgrounds and experiences to apply, even if you do not meet every requirement listed. We value unique perspectives and believe that they contribute to our innovation and growth.
