Himalayas logo
Dev.ProDE

Senior Site Reliability Engineer - OPS00023

Dev.Pro is a software engineering services company that helps technology companies meet growth ambitions through talent outsourcing. They provide skilled technology engineers to partner with clients, enabling them to achieve complex technical objectives and create business value.

Dev.Pro

Employee count: 501-1000

Argentina only

Stay safe on Himalayas

Never send money to companies. Jobs on Himalayas will never require payment from applicants.

We are a US-based outsource software development company that has been delivering exceptional software experience to our clients since 2011, helping technology companies to become industry leaders.

Over the past few years, we’ve been hiring specialists all over the world while our main development centers were in Ukraine. Now, we keep expanding and start growing our centers in different parts of the world. Dev.Pro is open to hire specialists from other countries as well as Ukrainians who live outside of Ukraine now. We stand with Ukraine and keep supporting our people by offering a friendly remote environment while adhering to the values of democracy, human rights, and state sovereignty.

About this opportunity

We invite a skilled and experienced Senior Site Reliability Engineer to join our fully remote, international team. In this role, you’ll ensure our GPU clusters and supporting AI infrastructure are reliable, resilient, automated, and observable at scale. You’ll work with NVIDIA, Slurm, and Kubernetes to turn bare-metal GPU clusters into high-performance AI infrastructure.

What's in it for you:

  • Join a fast-scaling company shaping the future of AI infrastructure in Europe
  • Scale, optimize, and automate bare-metal GPU clusters for some of the most compute-intensive AI workloads
  • Collaborate with a top-tier international team and grow through global AI and cloud events

Is that you?

  • 5+ years as an SRE, DevOps, or HPC engineer in large-scale compute environments
  • Expertise in HPC workload managers (Slurm, PBS Pro, LSF)
  • Strong Python or Go skills for automation and observability
  • Infrastructure-as-code experience (Terraform, Ansible, Helm)
  • Kubernetes experience for AI workloads (vLLM, Ray, Triton Inference Server)
  • GPU resource management knowledge (MIG, NCCL, CUDA, containers)
  • Experience with storage systems (VAST, WEKA, DDN) and parallel filesystems (GPFS, Lustre)
  • Linux systems engineering, CI/CD, and configuration management skills
  • Strategic thinking with strong technical and business communication

• Organization, autonomy, adaptability

• Advanced English level

Desirable:

  • Exposure to BlueField DPU, NVSwitch, or Slurm-on-Kubernetes hybrid orchestration

Key responsibilities and your contribution

In this role, you’ll apply your expertise to ensure our GPU clusters and AI infrastructure run reliably, efficiently, and at scale.

  • Automate deployment, scaling, and lifecycle management of GPU clusters
  • Optimize HPC scheduling and AI workload orchestration, including job preemption and GPU affinity
  • Implement observability and monitoring across GPU, NVLink, InfiniBand, and storage layers
  • Ensure reliability and uptime through SLOs, error budgets, chaos testing, and automated remediation
  • Collaborate with teams to optimize performance, resources, and fault recovery at petascale

About the job

Apply before

Posted on

Job type

Full Time

Experience level

Senior

Location requirements

Hiring timezones

Argentina +/- 0 hours

About Dev.Pro

Learn more about Dev.Pro and their company culture.

View company profile

We are a software development partner that empowers innovative technology companies to realize their growth ambitions and accelerate their time-to-market. Founded in 2011, we've built our reputation on being result-driven and quality-obsessed. Our core mission is to provide custom outsourced software development experiences, assembling tailored teams and delivering solutions that meet any skillset, complexity, or scale. We pride ourselves on our ability to help tech-enabled businesses leverage outsourced talent, enabling them to scale quickly and achieve their strategic objectives. Our skilled technology engineers collaborate closely with our clients, becoming an extension of their teams to tackle complex technical challenges and create tangible business value. Over the years, we've had the privilege of working with a diverse range of businesses, from Fortune 500 companies to ambitious, technology-driven startups. This breadth of experience has allowed us to continuously evolve and refine our processes, ensuring we consistently deliver beyond expectations.

Our global team of IT professionals is the backbone of our success. We embrace a diversified, remote work model, with dedicated development teams operating in over 55 countries across 5 continents. This global footprint provides us with the infrastructure, diverse skillsets, and flexibility necessary to deliver high-quality software development services for projects of any complexity or scale. We currently have over 950 talented experts and are continually growing. Our journey began with the vision to leverage years of experience working with development teams worldwide, offering select companies and partners exceptional outsourced software development services. Now, more than a decade later, Dev.Pro stands as a globally distributed company, committed to helping our partners navigate the ever-changing tech landscape and achieve sustained growth. Our commitment to transparency, collaboration, and continuous improvement ensures that we not only meet but exceed the expectations of our clients, fostering long-term partnerships built on trust and mutual success.

Employee benefits

Learn about the employee benefits and perks provided at Dev.Pro.

View benefits

Paid Time Off

You will get 30 paid rest days per year to use as holidays/vacation/other on the desired and requested dates.

Flexible Business Hours

We offer 2 flexible business hours during your workday to use as a lunch break or take care of a personal matter.

Remote Work

You can choose whether you want to work from the office in one of our locations, from your home office, or in a coworking space.

Healthcare

We offer 5 paid sick leave days and up to 60 days of medical leave in case you need it. We also hold wellness marathons and promote activities to keep you motivated under remote work conditions and support your well-being and mental health.

View Dev.Pro's employee benefits
Claim this profileDev.Pro logoDE

Dev.Pro

View company profile

Similar remote jobs

Here are other jobs you might want to apply for.

View all remote jobs

47 remote jobs at Dev.Pro

Explore the variety of open remote roles at Dev.Pro, offering flexible work options across multiple disciplines and skill levels.

View all jobs at Dev.Pro

Remote companies like Dev.Pro

Find your next opportunity by exploring profiles of companies that are similar to Dev.Pro. Compare culture, benefits, and job openings on Himalayas.

View all companies

Find your dream job

Sign up now and join over 100,000 remote workers who receive personalized job alerts, curated job matches, and more for free!

Sign up
Himalayas profile for an example user named Frankie Sullivan
Dev.Pro hiring Senior Site Reliability Engineer - OPS00023 • Remote (Work from Home) | Himalayas