What you'll be doing
- Take ownership of the availability and resiliency of Varo's cloud-based infrastructure; design and maintain disaster recovery scenarios; design self-healing and resiliency patterns
- Write and maintain infrastructure as code for core systems (Terraform, Terraform modules and Kubernetes helm charts); build and maintain CI/CD pipelines
- Improve observability and monitoring across Varo's infrastructure by implementing advanced tools and technologies
- Create and maintain monitoring dashboards, alerts, and log systems to quickly identify and resolve issues
- Implement advanced observability tools like distributed tracing and anomaly detection for deeper system insights and efficient troubleshooting
- Help lead high-profile incidents and facilitate blameless post-mortems
- Collaborate with development teams to implement and improve SLIs and SLOs for their services and to promote service ownership
- Use monitoring data to drive actionable insights and contribute to incident response strategies
- Automate operational tasks to save time and improve accuracy
- Write clean and scalable scripts, software and systems to manage platform infrastructure and applications
You’ll bring the following required skills and experiences
- 4+ years as a Site Reliability, DevOps, or Software Engineer with proficiency in one or more high-level languages (such as Python, GoLang, Ruby, Java, or JavaScript) required
- Excellent Linux and troubleshooting skills
- Experience in building and supporting high-availability cloud environments in AWS
- Experience using Infrastructure as code (IaC) and deployment automation with tools such as Terraform, Helm, Gitlab or equivalent
- Experience running Kubernetes in production
- Istio experience is a plus
- Experience with monitoring, logging and tracing tools such as Prometheus, Grafana, Jaeger/Tempo, ELK/Loki, OpenTelemetry
- Experience instrumenting code (Java/Kotlin, Python, Go, etc.) and creating simple instrumentation frameworks for developers to adopt where auto instrumentation may fall short
- Participate in an on-call rotation for after-hours production infrastructure incidents
- Experience with SDLC, CI/CD, and related tooling
- Kafka experience is a plus