NVIDIA is looking for a Senior AI/HPC Engineer to join its infrastructure Specialist team. The role involves deploying, managing, and maintaining AI/HPC infrastructure in Linux-based environments for new and existing customers.
Requirements
- BS/MS/PhD or equivalent experience in Computer Science, Electrical/Computer Engineering, Physics, Mathematics, or related fields.
- 5+ years providing in-depth support and deployment services, solving problems for hardware and software products.
- Linux System Administration, process management, package management, task scheduling, kernel management, boot procedures/troubleshooting, performance reporting/optimization/logging, network routing/advanced networking
- Cluster management technologies.
- Scripting proficiency.
- Good interpersonal skills with the ability to maintain and deliver resolutions for customer-blocking issues as they arise.
- Excellent verbal and written English skills.
- Strong organizational skills and ability to prioritize/multi-task easily with limited supervision.
- Industry-standard Linux certifications.
- Experience with Schedulers such as SLURM, LSF, UGE, etc.
- Hands-on experience with MPI, proficient in distributed communication programming and cluster debugging.
- In-depth understanding of NCCL principles and applications, with expertise in collective communication optimization for NVIDIA GPU clusters.
- Experience in deploying and optimizing high-speed networks (InfiniBand/Ethernet), with a clear understanding of how network architecture impacts GPU cluster performance
- Familiarity with automation tools (Ansible, Salt, Puppet, etc.), capable of implementing batch configuration and operational automation for GPU clusters and LLM deployment environments.
- Knowledge and hands-on experience with Kubernetes, including container orchestration for AI/ML workloads, resource scheduling, scaling, and integration with HPC environments.
Benefits
- 401k Matching
- Retirement Plan
- Generous Paid Time Off
- Visa Sponsorship
- Four Day Work Week
- Generous Parental Leave
- Tuition Reimbursement
- Relocation Assistance
