About the Role
This SRE will ensure the reliability, performance, and scalability of our MarTech SaaS platform that serves millions of users running thousands of marketing campaigns daily. They'll be responsible for monitoring systems, responding to incidents, and implementing automation to improve platform reliability.
Key Responsibilities Summary
- Monitor platform health using tools such as Prometheus, Grafana and/or DataDog and respond to alerts/incidents
- Lead incident response and conduct root cause analysis
- Build automation tools to reduce manual work and improve reliability
- Work with development teams to implement reliability best practices
- Plan for system scaling needs as the platform grows
What We're Looking For
Required:
- Minimum 2 years of SRE/DevOps experience with monitoring and reliability focus
- Cloud Platform hands-on experience (GCP, Azure or AWS)
- APM Monitoring platform experience
- SRE best practices knowledge (SLIs/SLOs, error budgets, etc.)
- Backend development experience (Java, PHP and/or Node.js preferred)
- Incident management experience including on-call responsibilities
Preferred:
- SaaS platform experience, especially high-volume environments
- MarTech/AdTech industry background
- Experience scaling systems for millions of users
- Security best practices knowledge
What We Offer
Remote-first culture with flexible working arrangements
High-impact role in a small, collaborative team where your contributions directly matter
Growth opportunities as we scale our platform and expand our engineering team
Competitive compensation and benefits package
Learning budget for professional development and certifications
Modern tech stack with opportunities to work with cutting-edgesolutions
