Principal Cluster Engineer, Training Infrastructure

DataCrunch is a cloud service provider specializing in high-performance GPU servers and clusters for machine learning, powered by renewable energy.

DataCrunch

Employee count: 11-50

AL, AD + 48 more

Stay safe on Himalayas

Never send money to companies. Jobs on Himalayas will never require payment from applicants.

Imagine a future where everyone has instant, low-cost access to intelligence. We’re building a fully featured European AI cloud — with everything one needs to train, experiment with, and deploy AI models. In addition, our GPUs run on 100% renewable energy.

We’re ambitious, curious, and gutsy doers. We practice a low hierarchy across the company and high morale in our teams. We’ve already achieved a lot, yet we’re only getting started. Now it’s your chance to join the ride. We offer more than just the job — we offer a career-defining opportunity to be part of building something big.

As a cherry on top, we’ve recently raised $64M in Series A and are ready to reach new heights.

about the role

We’re looking for a Principal Cluster Engineer to own and evolve our InfiniBand-connected GPU training infrastructure. This is a highly technical role focused on building and operating large-scale AI and HPC clusters that power the next generation of machine learning workloads.

You will work closely with ML researchers, cloud platform teams, datacenter operations, and procurement to ensure Verda’s GPU infrastructure is fast, reliable, and ready to support cutting-edge training workloads. In this role you will architect and operate large-scale InfiniBand fabrics, push storage and compute performance to their limits, build automation and observability tooling, and help define the technical and operational standards the team works to.

You’ll play a key role in translating customer and product requirements into real infrastructure capabilities, ensuring clusters are designed for performance, reliability, and scale.

why verda

Competitive cash and equity package, plus benefits (healthcare, lunch, wellbeing, etc.
Profitable operations with rapid, sustained growth
A genuine once-in-a-lifetime opportunity to join one of Finland’s few true explosive growth stories, shaping a category-defining AI cloud from the ground up
Work alongside world-class engineers, researchers, and partners across the global AI ecosystem
A small, high-performing team of around 70 people representing 27 nationalities

practicalities

Location: Remote - EU
Start Date: As soon as possible
Contract Type: Full-time
Working Language: English

your responsibilities

Design, deploy, and continuously improve large-scale InfiniBand-connected GPU training clusters
Drive cluster-level storage performance, translating customer SLAs into internal throughput and IOPS performance targets
Build and maintain automation for cluster provisioning, OS imaging, firmware management, and day-two operations using Python
Contribute to infrastructure-as-code and CI/CD pipelines for cluster and platform management
Establish and own performance baselines across compute, network fabric, and storage layers
Identify, diagnose, and resolve performance bottlenecks across the full cluster stack
Implement and maintain observability tooling including metrics, alerting, and anomaly detection systems
Work closely with datacenter operations, cloud platform teams, ML researchers, and procurement to translate requirements into infrastructure architecture
Participate in the on-call rotation and help maintain production reliability of the training clusters

your key competencies

7+ years of hands-on infrastructure or systems engineering experience
Experience operating large-scale HPC or AI training clusters (1000+ GPU nodes)
Strong production experience with InfiniBand fabrics
Experience working with NVIDIA GPU hardware in training workloads (Hopper or newer preferred)
Proven experience leading or tech-leading engineering teams, setting technical direction, reviewing work, and mentoring engineers
Experience with automation and scripting (Python preferred)
Experience working with infrastructure-as-code tools such as Terraform, Ansible, or Salt

Nice to have

Experience with the NVIDIA HPC software stack or UFM
Knowledge of NCCL and debugging distributed GPU training workloads
Experience tuning Linux kernels or using eBPF for performance optimization in HPC environments

success criteria for this role in the next 6-12 months

Optimized production AI/HPC clusters with measurable improvements in reliability, performance, and job success rates
Implemented automation and tooling that significantly reduces operational overhead and speeds up incident resolution
Established strong operational practices for monitoring, alerting, capacity planning, and incident management
Built strong collaboration with datacenter operations, ML researchers, and cloud platform teams to translate workload requirements into infrastructure improvements
Mentored engineers and helped build deeper internal expertise in GPU cluster operations and performance engineering

how the process looks like

Introduction chat with the TA Partner (45 mins): Learn more about Verda and share your career aspirations.
Conversation with the CTO (30 mins): A focused discussion with our CTO to explore technical vision, infrastructure strategy, and how your experience aligns with the future of Verda’s AI platform.
Technical interview with the team (60 mins): Learn about the role and its requirements and dive deeper into your expertise and discuss technical challenges.
Final interview (45 mins): Meet with our COO for a culture-fit conversation.

what's next

Apply sooner than later. This job ad will be removed when we’ve found the right person.

Please submit your application through our Careers page. We don’t accept applications sent by email.

Apply now

Please let DataCrunch know you found this job on Himalayas. This helps us grow!

Apply now

About the job

Apply before

Jun 18, 2026

Posted on

Apr 19, 2026

Job type

Full Time

Experience level

Senior

Experience

7 years minimum

Browse similar jobs

Remote Senior Infrastructure-Engineering Jobs Remote Full Time Infrastructure-Engineering Jobs Remote Senior Infrastructure-Engineering Jobs in Albania Remote Full Time Jobs in Albania Remote Infrastructure-Engineering Jobs in Albania

About DataCrunch

Learn more about DataCrunch and their company culture.

View company profile

DataCrunch.io is an innovative cloud service provider focused on delivering premium dedicated GPU servers and clusters tailored for machine learning applications. With a commitment to utilizing 100% renewable energy, DataCrunch positions itself as a forward-thinking player in the AI cloud computing space, ready to tackle the challenges of modern machine learning and AI workloads. The company offers a range of GPU instances, from on-demand resources to long-term managed GPU clusters, all designed to provide seamless model inference services.

DataCrunch's infrastructure is powered by high-performance NVIDIA technology, specifically designed to ensure maximum speed and efficiency for complex machine learning models. Their service model allows users to easily sign up and utilize resources on a pay-as-you-go basis, optimizing costs for both individuals and organizations. The platform also features expert support and an intuitive dashboard that empowers users to start, stop, or hibernate their GPU instances instantly. As part of their commitment to security and compliance, DataCrunch holds ISO27001 certification, ensuring all operations meet stringent international standards. Furthermore, the company promotes environmentally friendly practices by ensuring that all of its infrastructure is powered by renewable energy sources.

Apply now

Please let DataCrunch know you found this job on Himalayas. This helps us grow!

Apply now