We are looking for a Staff Machine Learning Operations Engineer - Devops/SRE to anchor operations for our LLM- and ML-powered services running on AWS, Kubernetes, Snowflake, and Datadog. As a staff-level engineer, you'll combine deep hands-on expertise with strong communication, project leadership, and architectural judgment to raise the bar on performance, resilience, observability, and maintainability.
Requirements
- AWS (EKS, EC2, VPC, IAM/IRSA, ALB/NLB, S3, KMS)
- Kubernetes (workload autoscaling, rollout strategies, Helm/GitOps patterns, capacity & cost optimization)
- Terraform (modular design, environment separation, policy-as-code, drift control)
- Datadog (APM/tracing, logs, metrics, synthetics, SLOs; building actionable dashboards and alert pipelines)
- Snowflake (warehouse sizing and tuning, tasks/streams, performance optimization, cost/credit governance, RBAC)
- Proven experience running LLM/ML production systems (model serving, data/feature pipelines, evaluation, and guardrails)
- Strong communication and stakeholder management; able to lead cross-functional projects and set architectural direction
- Track record of improving performance, resiliency, observability, and maintainability in complex, distributed systems
- Solid incident command, on-call ownership, and post-mortem leadership
Benefits
- Self Managed PTO
- Flexible work hours
- MacBook (or PC if you prefer!) + Setup Fee ($500)
- Remote environment
- Culture built on innovation that values big ideas
