What You'll Do
- Manage and maintain large scale production Kubernetes clusters for ML workloads, including ML platform infrastructure and necessary dev ops.
- Contribute to Spotify ML Platform SDK and build tools for various ML operations.
- Collaborate with Machine Learning Engineers (MLE), researchers, and various product teams to deliver scalable ML platform tooling solutions that meet the timelines and specifications of given requirements.
- Work independently and collaboratively on squad projects that often requires learning and applying new technologies that may go beyond existing skillsets.
- Designs, documents and implements reliable, testable and maintainable solutions ML infrastructure capabilities.
Who You Are
- You have 3+ years of hands-on experience implementing production ML infrastructure at scale in Python, Go or similar languages
- 3+ years of experience working with a public cloud provider such as GCP, AWS, or Azure. Preferably GCP.
- Knowledge of deep learning fundamentals, algorithms, and open-source tools such as Huggingface, Ray, PyTorch or TensorFlow
- Good to have an understanding of distributed training leveraging GPUs and Kubernetes
- You have a general understanding of data processing for ML
- You have experience with agile software processes and modular code design following industry standards
Where You'll Be
- This role is based in Toronto.
- We offer you the flexibility to work where you work best! There will be some in person meetings, but still allows for flexibility to work from home.