We're looking for a Member of Technical Staff to design and improve scheduling and resource allocation for inference and training coexistence on shared GPU clusters, and build, operate, and scale GPU infrastructure across clusters of thousands of GPUs.
Requirements
- Deep understanding of Linux and networking and storage stacks
- Experience operating and scaling GPU infrastructure, Kubernetes, Slurm, and distributed storage systems
- Designing, building, and operating distributed systems at scale
- Track record of running critical infrastructure reliably — monitoring, incident response, and automation that reduces toil
- Enough understanding of training and inference workloads to collaborate with researchers and make sound infrastructure decisions
Benefits
- Competitive salary and equity
- Private health coverage
- Pension contribution (UK, Canada, US)
- Unlimited paid vacation
- Fully-distributed, async-first culture
- Hardware setup of your choice
- Stipends for phone, internet, and meals
