We are looking for a Principal Site Reliability Engineer to own the reliability, scalability, and security of our cloud infrastructure, working closely with a small platform engineering team and partnering with engineering leads across simulation and ML, and customer-facing teams.
Requirements
- Infrastructure-as-code proficiency - Terraform modules, state management, and multi-environment patterns.
- Deep AWS experience - EKS, EC2, IAM, S3, Storage Gateway, VPC networking, Transit Gateway, CloudFront, KMS, and IRSA.
- Kubernetes expertise - cluster operations, node pools, probes, cordoning, pod scheduling, RBAC, Helm, node autoscaling (Karpenter experience a plus); solid understanding of containerization and AMI lifecycle management.
- CI/CD - experience with GitOps workflows and pipeline tooling (ArgoCD, GitHub Actions, Jenkins)
- Solid networking fundamentals - CIDR design, security groups, DNS, load balancing, VPN, cross-region connectivity.
- Experience with monitoring and observability tooling - Prometheus, Grafana, Elasticsearch.
- Comfort with Python and Bash for tooling and automation.
- Familiarity working across Linux and Windows environments.
- Operational familiarity with Windows Server is a meaningful advantage.
