Himalayas logo
JobgetherJO

Staff Software Engineer, GPU Infrastructure (HPC)

Jobgether
United States only

Stay safe on Himalayas

Never send money to companies. Jobs on Himalayas will never require payment from applicants.

This position is posted by Jobgether on behalf of a partner company. We are currently looking for a Staff Software Engineer, GPU Infrastructure (HPC) in United States, Canada.

As a Staff Software Engineer in GPU infrastructure, you will design, build, and operate high-performance computing clusters to accelerate AI and machine learning workloads. You will collaborate closely with researchers and engineers to ensure AI workloads run reliably, efficiently, and at scale across cloud environments. The role includes optimizing infrastructure for cost, performance, and stability, while providing self-service tools for ML teams. You will troubleshoot complex issues, implement automation and observability best practices, and drive innovations in distributed GPU/TPU systems. This position offers opportunities to mentor engineers, influence infrastructure strategy, and directly impact the development of cutting-edge AI models. You will work in a fast-paced, collaborative environment where technical excellence and scalability are key priorities.

Accountabilities:
• Design, deploy, and manage Kubernetes-based GPU/TPU superclusters across multiple clouds for AI/ML workloads.
• Optimize HPC infrastructure for distributed training frameworks such as JAX, PyTorch, and TensorFlow.
• Identify and resolve performance bottlenecks, system failures, and infrastructure issues.
• Build self-service tools to enable researchers to monitor, debug, and optimize AI/ML training jobs independently.
• Implement best practices for automation, observability, and infrastructure-as-code (IaC).
• Collaborate closely with AI researchers and ML engineers to translate emerging needs into robust infrastructure solutions.
• Mentor team members, conduct code reviews, document processes, and foster a culture of knowledge sharing.

Requirements

 • Deep expertise in ML/HPC infrastructure, including GPU/TPU clusters and distributed training frameworks.
• Proven experience with cloud-native Kubernetes deployments at scale.
• Strong programming skills in Python and Go, with preference for open-source contributions.
• Knowledge of Linux internals, RDMA networking, and performance optimization for ML workloads.
• Demonstrated ability to collaborate with research teams and solve complex infrastructure challenges.
• Self-directed problem-solving mindset with ability to drive impact in fast-paced environments.
• Experience in building scalable, resilient, and maintainable infrastructure systems.

Benefits

 • Inclusive and collaborative work culture.
• Opportunities to work on cutting-edge AI research and infrastructure projects.
• Weekly lunch stipend, in-office meals, and snacks.
• Comprehensive health and dental benefits, including mental health budget.
• 100% parental leave top-up for up to six months.
• Personal enrichment benefits for arts, fitness, and workspace improvement.
• Remote-flexible work options, co-working stipend, and offices in major global cities.
• Six weeks of vacation (30 working days).

Jobgether is a Talent Matching Platform that partners with companies worldwide to efficiently connect top talent with the right opportunities through AI-driven job matching.
When you apply, your profile goes through our AI-powered screening process designed to identify top talent efficiently and fairly.
🔍 Our AI evaluates your CV and LinkedIn profile thoroughly, analyzing your skills, experience and achievements.
📊 It compares your profile to the job’s core requirements and past success factors to determine your match score.
🎯 Based on this analysis, we automatically shortlist the 3 candidates with the highest match to the role.
🧠 When necessary, our human team may perform an additional manual review to ensure no strong profile is missed.
The process is transparent, skills-based, and free of bias — focusing solely on your fit for the role.
Once the shortlist is completed, we share it directly with the company that owns the job opening. The final decision and next steps (such as interviews or additional assessments) are then made by their internal hiring team.

Thank you for your interest!

About the job

Apply before

Posted on

Job type

Full Time

Experience level

Senior

Location requirements

Hiring timezones

United States +/- 0 hours
Claim this profileJobgether logoJO

Jobgether

View company profile

Similar remote jobs

Here are other jobs you might want to apply for.

View all remote jobs

595 remote jobs at Jobgether

Explore the variety of open remote roles at Jobgether, offering flexible work options across multiple disciplines and skill levels.

View all jobs at Jobgether

Find your dream job

Sign up now and join over 100,000 remote workers who receive personalized job alerts, curated job matches, and more for free!

Sign up
Himalayas profile for an example user named Frankie Sullivan
Jobgether hiring Staff Software Engineer, GPU Infrastructure (HPC) • Remote (Work from Home) | Himalayas