This is a remote position.
We are looking for a Senior Site Reliability Engineer to support advanced AI platforms responsible for production-grade applications and pipelines. The role focuses on building and maintaining reliability, scalability, and operational excellence across multiple AI-driven systems.
The engineer will work on a central operational layer for monitoring and managing AI workloads, improving system stability, and reducing incidents. This is a hands-on role requiring direct involvement in diagnosing production issues, implementing fixes, and optimising monitoring, alerting, and CI/CD processes.
The position requires close collaboration with engineering teams to improve release quality, standardise telemetry, and ensure stable and predictable system behaviour in a distributed cloud environment.
- Build and maintain central monitoring and alerting layer for AI applications and pipelines
- Define and implement SLIs, alerts, and operational dashboards
- Manage incidents including triage, coordination, root cause analysis, and prevention
- Standardise telemetry across systems including latency, throughput, and failures
- Optimise CI CD pipelines and introduce quality gates for reliability
- Work closely with engineering teams to reduce recurring issues and improve stability
Requirements
- Minimum5+ years of experiencein SRE, Platform, or Production Engineering
- Strong hands on experience withKubernetesand production environments
- Experience withAzure and Azure DevOps
- Experience with monitoring tools such asDatadog
- Strong understanding ofincident management and root cause analysis
- Ability to build practical monitoring and alerting systems
- Experience withAI or LLM pipelines
- Experience building monitoring platforms across multiple systems
- Experience withGrafana
- Experience working in large scale or distributed environments
- Strong ownership mindset and accountability for system stability
- Proactive approach to identifying risks and improvements
- Hands on engineer actively working with systems, not only coordinating
- Comfortable working in dynamic and evolving environments
Benefits
- Solid, competitive salary
- Work in a multinational environment on international projects
- Comprehensive healthcare
- Long-term B2B contract with a stable project pipeline
- Work model: fully remote
