Articul8 AI is seeking an experienced Site Reliability Engineer (SRE) to join their team and help ensure the reliability, performance, and scalability of their GenAI SaaS platform.
Requirements
- Architect and maintain scalable, highly available infrastructure for our GenAI platform.
- Design and implement robust monitoring, alerting, and observability solutions to proactively ensure system health and performance.
- Automate deployment, scaling, and management of our cloud-native infrastructure, reducing toil and improving efficiency.
- Define, measure, and improve Service Level Objectives (SLOs) and Service Level Indicators (SLIs) to deliver outstanding service quality.
- Participate in on-call rotations and provide rapid response to production incidents, minimizing downtime and user impact.
- Collaborate closely with development teams to build reliable, scalable, and efficient systems for complex AI workloads.
- Lead incident response efforts, conduct thorough post-mortems, and champion continuous improvement initiatives.
- Optimize infrastructure for performance, scalability, and cost-effectiveness—especially for high-demand AI workloads.
- Implement and enforce security best practices across all systems and environments.
- Create and maintain comprehensive documentation, including runbooks and knowledge base articles, to foster a culture of shared knowledge.
