Research Infrastructure Engineer, Training Systems. Build and maintain infrastructure for large-scale model training and experimentation. Design APIs and interfaces for complex training workflows. Improve reliability, debuggability, and performance across training and data pipelines.
Requirements
- Systems engineering role focused on ML training infrastructure
- Build and maintain infrastructure for large-scale model training and experimentation
- Design APIs and interfaces that make complex training workflows easier to express and harder to misuse
- Improve reliability, debuggability, and performance across training and data pipelines
- Debug issues spanning Python, PyTorch, distributed systems, GPUs, networking, and storage
- Write tests, benchmarks, and diagnostics that catch meaningful regressions
Benefits
- Diversity and inclusion policy
- Equal employment opportunity
- Background checks administered in accordance with applicable law
- Fair chance hiring policy
