The Site Reliability teams at Datadog are responsible for ensuring that our high-volume, low-latency environments continue to perform around the clock. These teams collaborate closely with our product engineers to ensure that Datadog can monitor millions of servers and containers, ensuring our customers always have dependable and actionable data at their fingertips. You’ll be responsible for shaping the infrastructure of our data-intensive, real-time services as we continue to grow at petabyte scale.
- Keep our service reliable, available and fast
- Respond to, investigate and fix service issues, whether they be deep in the OS kernel or in the application code.
- Design, build and maintain the infrastructure we need to support orders of magnitude more customers.
- You have a track record working with large-scale distributed systems, preferably in the cloud OR you have a BS/MS/PhD in a scientific field or equivalent experience
- You value correctness and efficiency; you leave no stone unturned when diagnosing production issues
- You handle infrastructure with code because automation lets you focus on the more difficult and rewarding problems
- You have production experience with distributed compute/storage tools, e.g. zookeeper, cassandra, postgres, kafka, elasticsearch, redis
- You have submitted bug fixes to the aforementioned projects
- You are fully fluent in python, ruby and go
Is this you? Tell us why, and apply now. Include links to your github, stackoverflow or other online projects.
Please let Datadog know you found this job on Himalayas. This will help us grow!
About this role
August 18th, 2021
Job posted on
October 16th, 2020
About the companyModern monitoring & analytics. See inside any stack, any app, at any scale, anywhere Datadog is a monitoring and analytics platform for large-scale application infrastructure and applications. C...
We'll keep you updated when the best new remote jobs pop up.