We're looking for a Senior Machine Learning Systems Engineer to lead efforts to scale and optimize the training system for our large-scale multimodal and foundation models. You'll design distributed training systems using Megatron-LM, NVIDIA NeMo, FSDP, and Triton.
Requirements
- Design, implement, and optimize large-scale machine learning systems for training and inference
- Improve all aspects of performance, including GPU utilization, communication overhead, and memory efficiency
- Partner with research and modeling teams to align systems with algorithmic needs
- Evaluate and apply best practices for distributed training using industry-leading frameworks
- Dive deep into low-level optimization, including custom CUDA or Triton kernels
- Debug, profile, and fine-tune training workflows to unlock new levels of scalability
