The fine print (but a bit more exciting):
- This is a remote-first role based in the UK or Poland, with occasional in-person team meetups. Our HQ is in Jersey (UK), and our ~90-person team is spread across the UK, US, and EU
- Our infrastructure team is intentionally small. You’ll be expected to operate independently as a subject matter expert. There isn’t a large platform team to lean on
- Our tech stack is pragmatic and maintainable rather than trend-driven. We run a monolithic Ruby on Rails application with a React frontend, deployed and managed via Cloud66. We use GitHub and GitHub Actions for CI, Terraform (OpenTofu) for infrastructure as code, and Datadog for monitoring and observability.
- Pinpoint isn’t for everyone. We’re still operating in startup mode. Things move quickly, priorities evolve, and we value people who take ownership rather than wait for instruction
- Our values actually matter here. We hire people who reflect them in how they work, collaborate, and make decisions
About the Role:
- Improve monitoring and alerting across infrastructure and application layers
- Diagnose and reduce production instability, including load spikes and database bottlenecks
- Strengthen our use of Datadog, particularly logging quality and alert signal-to-noise ratio
- Improve capacity planning through testing, monitoring, and forecasting
- Ensure new features ship with appropriate production metrics and reliability safeguards
- Implement and maintain best practices across infrastructure security, compliance, and vulnerability management
- Participate in on-call rotations, incident response, and post-incident analysis
- Maintain clear and up-to-date infrastructure documentation
- Improve our CI/CD pipeline and overall infrastructure performance
What Success Looks Like:
- Fewer surprise production incidents
- Faster diagnosis and recovery when issues occur
- Clear, actionable dashboards and alerts
- More predictable performance under load
- Increased engineering confidence when shipping new features
About You:
- 4+ years of hands-on experience in infrastructure, DevOps, platform, or site reliability roles
- Experience maintaining the reliability and scalability of a production SaaS application
- Experience participating in on-call rotations and incident response
- Strong working knowledge of infrastructure as code (e.g. Terraform), testing practices, containerisation, and orchestration
- Experience with monitoring, metrics visualisation, and alerting platforms, ideally Datadog
- Strong problem-solving ability and sound technical judgment in ambiguous environments
- Clear communicator who documents decisions and contributes to thoughtful post-incident reviews
- Comfortable getting close to the codebase when needed
- Motivated to improve systems through automation and efficiency gains
- Curious and resourceful — AWS documentation doesn’t intimidate you
- Experience optimising MySQL, PostgreSQL, or Redis
- Deep familiarity with AWS
- Advanced Datadog experience, particularly around logging and observability
- Experience working within SOC2 or ISO27001 environments
- Familiarity with Ruby or Ruby on Rails applications
What We Offer:
- Comprehensive healthcare – Excellent medical, dental, & vision coverage for you and your family
- Unlimited holidays – Take the time you need to rest and recharge
- Mental health support – Unlimited, immediate access to professional counseling via Spill
- Retirement contributions – 401k or pension contributions depending on your location
- Remote-first – Work where you’re most productive, with flexibility and trust as the default
- Equity with real upside – Share in the long-term value you help create
- Fully paid parental leave – Up to 16 weeks of paid leave for new parents
- Learning budget – Annual funds for courses, books, or anything that supports your growth
