HimalayasHimalayas logo
DataCrunchDA

Principal Cluster Engineer, Training Infrastructure

DataCrunch is a cloud service provider specializing in high-performance GPU servers and clusters for machine learning, powered by renewable energy.

DataCrunch

Employee count: 11-50

AL, AD + 48 more

Stay safe on Himalayas

Never send money to companies. Jobs on Himalayas will never require payment from applicants.

Imagine a future where everyone has instant, low-cost access to intelligence. We’re building a fully featured European AI cloud — with everything one needs to train, experiment with, and deploy AI models. In addition, our GPUs run on 100% renewable energy.

We’re ambitious, curious, and gutsy doers. We practice a low hierarchy across the company and high morale in our teams. We’ve already achieved a lot, yet we’re only getting started. Now it’s your chance to join the ride. We offer more than just the job — we offer a career-defining opportunity to be part of building something big.

As a cherry on top, we’ve recently raised $64M in Series A and are ready to reach new heights.

about the role

We’re looking for a Principal Cluster Engineer to own and evolve our InfiniBand-connected GPU training infrastructure. This is a highly technical role focused on building and operating large-scale AI and HPC clusters that power the next generation of machine learning workloads.

You will work closely with ML researchers, cloud platform teams, datacenter operations, and procurement to ensure Verda’s GPU infrastructure is fast, reliable, and ready to support cutting-edge training workloads. In this role you will architect and operate large-scale InfiniBand fabrics, push storage and compute performance to their limits, build automation and observability tooling, and help define the technical and operational standards the team works to.

You’ll play a key role in translating customer and product requirements into real infrastructure capabilities, ensuring clusters are designed for performance, reliability, and scale.

why verda

  • Competitive cash and equity package, plus benefits (healthcare, lunch, wellbeing, etc.
  • Profitable operations with rapid, sustained growth
  • A genuine once-in-a-lifetime opportunity to join one of Finland’s few true explosive growth stories, shaping a category-defining AI cloud from the ground up
  • Work alongside world-class engineers, researchers, and partners across the global AI ecosystem
  • A small, high-performing team of around 70 people representing 27 nationalities

practicalities

  • Location: Remote - EU
  • Start Date: As soon as possible
  • Contract Type: Full-time
  • Working Language: English

your responsibilities

  • Design, deploy, and continuously improve large-scale InfiniBand-connected GPU training clusters
  • Drive cluster-level storage performance, translating customer SLAs into internal throughput and IOPS performance targets
  • Build and maintain automation for cluster provisioning, OS imaging, firmware management, and day-two operations using Python
  • Contribute to infrastructure-as-code and CI/CD pipelines for cluster and platform management
  • Establish and own performance baselines across compute, network fabric, and storage layers
  • Identify, diagnose, and resolve performance bottlenecks across the full cluster stack
  • Implement and maintain observability tooling including metrics, alerting, and anomaly detection systems
  • Work closely with datacenter operations, cloud platform teams, ML researchers, and procurement to translate requirements into infrastructure architecture
  • Participate in the on-call rotation and help maintain production reliability of the training clusters

your key competencies

  • 7+ years of hands-on infrastructure or systems engineering experience
  • Experience operating large-scale HPC or AI training clusters (1000+ GPU nodes)
  • Strong production experience with InfiniBand fabrics
  • Experience working with NVIDIA GPU hardware in training workloads (Hopper or newer preferred)
  • Proven experience leading or tech-leading engineering teams, setting technical direction, reviewing work, and mentoring engineers
  • Experience with automation and scripting (Python preferred)
  • Experience working with infrastructure-as-code tools such as Terraform, Ansible, or Salt
Nice to have
  • Experience with the NVIDIA HPC software stack or UFM
  • Knowledge of NCCL and debugging distributed GPU training workloads
  • Experience tuning Linux kernels or using eBPF for performance optimization in HPC environments

success criteria for this role in the next 6-12 months

  • Optimized production AI/HPC clusters with measurable improvements in reliability, performance, and job success rates
  • Implemented automation and tooling that significantly reduces operational overhead and speeds up incident resolution
  • Established strong operational practices for monitoring, alerting, capacity planning, and incident management
  • Built strong collaboration with datacenter operations, ML researchers, and cloud platform teams to translate workload requirements into infrastructure improvements
  • Mentored engineers and helped build deeper internal expertise in GPU cluster operations and performance engineering

how the process looks like

  1. Introduction chat with the TA Partner (45 mins): Learn more about Verda and share your career aspirations.
  2. Conversation with the CTO (30 mins): A focused discussion with our CTO to explore technical vision, infrastructure strategy, and how your experience aligns with the future of Verda’s AI platform.
  3. Technical interview with the team (60 mins): Learn about the role and its requirements and dive deeper into your expertise and discuss technical challenges.
  4. Final interview (45 mins): Meet with our COO for a culture-fit conversation.

what's next

Apply sooner than later. This job ad will be removed when we’ve found the right person.

Please submit your application through our Careers page. We don’t accept applications sent by email.

About the job

Apply before

Posted on

Job type

Full Time

Experience level

Experience

7 years minimum

About DataCrunch

Learn more about DataCrunch and their company culture.

View company profile

DataCrunch.io is an innovative cloud service provider focused on delivering premium dedicated GPU servers and clusters tailored for machine learning applications. With a commitment to utilizing 100% renewable energy, DataCrunch positions itself as a forward-thinking player in the AI cloud computing space, ready to tackle the challenges of modern machine learning and AI workloads. The company offers a range of GPU instances, from on-demand resources to long-term managed GPU clusters, all designed to provide seamless model inference services.

DataCrunch's infrastructure is powered by high-performance NVIDIA technology, specifically designed to ensure maximum speed and efficiency for complex machine learning models. Their service model allows users to easily sign up and utilize resources on a pay-as-you-go basis, optimizing costs for both individuals and organizations. The platform also features expert support and an intuitive dashboard that empowers users to start, stop, or hibernate their GPU instances instantly. As part of their commitment to security and compliance, DataCrunch holds ISO27001 certification, ensuring all operations meet stringent international standards. Furthermore, the company promotes environmentally friendly practices by ensuring that all of its infrastructure is powered by renewable energy sources.

Claim this profileDataCrunch logoDA

DataCrunch

View company profile

Similar remote jobs

Here are other jobs you might want to apply for.

View all remote jobs

2 remote jobs at DataCrunch

Explore the variety of open remote roles at DataCrunch, offering flexible work options across multiple disciplines and skill levels.

View all jobs at DataCrunch

Remote companies like DataCrunch

Find your next opportunity by exploring profiles of companies that are similar to DataCrunch. Compare culture, benefits, and job openings on Himalayas.

View all companies

Find your dream job

Sign up now and join over 100,000 remote workers who receive personalized job alerts, curated job matches, and more for free!

Sign up
Himalayas profile for an example user named Frankie Sullivan