Staff Platform Infrastructure Engineer: Observability position is responsible for designing, building, and maintaining robust telemetry pipelines, crafting observability strategies, and resolving critical incidents. Requires 10+ years of experience in Software Development, Observability Engineering, or Site Reliability Engineering.
Requirements
- Minimum of 10 years of experience in Software Development, Observability Engineering, or Site Reliability Engineering.
- Minimum of 5 years of in-depth experience with Observability platforms like Datadog, Dynatrace, Honeycomb, New Relic, Splunk, or Prometheus.
- Minimum of 4 years of cloud experience with Azure, AWS, or GCP.
- Minimum 4 years of experience building and managing telemetry pipelines, including at least 1 year of hands-on experience with OpenTelemetry
- Minimum of 4 years of experience with Kubernetes environments
- Understanding and experience with declarative infrastructure using Terraform
- Ability to work an on-call rotation that may include weekends as the business need dictates
- Proven ability to collect logs, metrics & traces and implement Observability solutions for applications and infrastructure
- Solid understanding of distributed tracing and experience instrumenting applications to analyze performance bottlenecks
- Hands-on experience configuring and deploying OTEL collectors for telemetry data collection, processing and export
- Strong understanding of Kubernetes architecture and experience managing observability within K8s environments
- Proficiency in using Kustomize for Kubernetes configurations and Terraform for infrastructure provisioning
- Experience integrating observability practices and tools into CI/CD pipelines for automated deployments
- Exceptional analytical and problem-solving skills to diagnose and resolve complex issues within observability systems, including data pipeline failures, instrumentation errors, and performance bottlenecks
- Ability to demonstrate a strong Site Reliability Engineering (SRE) mindset with a focus on automation, proactive monitoring, and continuous improvement to ensure system reliability and availability
- Experience defining and implementing Service Level Objectives (SLOs) and Service Level Indicators (SLIs) to measure system performance and drive data-driven decisions
Benefits
- Comprehensive benefits designed to support physical, mental, and financial health
- Generous Paid Time Off
- 401k Matching
