Senior IaaS / Kubernetes Platform Engineer (worldwide remote, work anywhere)

CloudLinux provides a commercially supported operating system optimized for shared hosting providers and data centers, enhancing server stability, security, and resource management.

CloudLinux

Employee count: 201-500

Poland only

Stay safe on Himalayas

Never send money to companies. Jobs on Himalayas will never require payment from applicants.

CloudLinux is a global remote-first company. We are driven by our principles: do the right thing, employees first, we are remote first, and we deliver high-volume, low-cost Linux infrastructure and security products that help companies to increase the efficiency of their operations. Every person on our team supports each other and does what we can to ensure we all are successful.

Check out our website for more information https://cloudlinux.com/

We are looking for a Senior IaaS / Kubernetes Platform Engineer to join our Infrastructure Department and become a key contributor to the design, implementation, and operation of our private cloud and multi-tenant Kubernetes platform.

Our infrastructure powers 500+ VMs across multiple datacenters, serving 20+ engineering teams. We are in the process of evolving from an OpenNebula-based virtualization platform toward a Kubernetes-native multi-tenant cloud with KubeVirt for VM orchestration — while maintaining reliability and operational excellence throughout the transition.

You will work alongside the existing IaaS Tech Lead and Network Engineer, and must be capable of independently owning and operating the full IaaS stack (compute, storage, networking, bare metal) if needed. This is not a "Kubernetes-only" role — it requires deep infrastructure generalist skills combined with Kubernetes platform expertise.

What You Will Do

Kubernetes Platform Engineering (Primary Focus — 40%)

Design, build, and operate a multi-tenant Kubernetes platform using Cluster API (CAPI) with bare-metal providers (Metal3/Sidero).
Implement hard multi-tenancy using vCluster (Loft Labs) or similar technology, providing isolated Kubernetes API servers per tenant.
Deploy and manage KubeVirt for VM orchestration within Kubernetes, including CPU pinning, NUMA awareness, and HugePages configuration.
Implement GitOps-driven infrastructure using ArgoCD or Flux as the single source of truth for all cluster configurations.
Deploy and manage Policy-as-Code using Kyverno or OPA Gatekeeper for admission control, resource quotas, and security policies.
Build self-service capabilities using Crossplane or similar Kubernetes-native infrastructure provisioning tools.

Storage Engineering (20%)

Operate and optimize Ceph distributed storage clusters (currently 1 PiB raw, 149 OSDs, Quincy 17.2.5).
Manage Rook-Ceph operator deployments at scale on modern Kubernetes (v1.28+).
Implement storage tiering: Ceph for bulk storage, local NVMe for high-IOPS workloads, LINSTOR/DRBD or TopoLVM for ultra-fast replicated storage.
Design and implement per-VM / per-tenant I/O isolation on shared Ceph clusters.
Manage CDI (Containerized Data Importer) for VM image lifecycle in KubeVirt environments.

Networking (15%)

Deploy and manage overlay networks for pod networking, micro-segmentation, and WireGuard/IPsec encryption.
Implement Cluster Mesh for multi-datacenter pod-to-pod connectivity.
Configure Multus CNI and SR-IOV for multi-NIC VM support in KubeVirt.
Work with physical network infrastructure: Juniper switches (JunOS), BGP (eBGP/iBGP), EVPN/VXLAN, VLANs.
Maintain IPSec site-to-site connectivity between datacenters.

Reliability and Operations (15%)

Practice SRE discipline: define and maintain SLOs with error budgets, implement proactive capacity management with 6-12 month forecasting.
Design and execute chaos engineering experiments to validate system resilience.
Participate in on-call rotation for IaaS infrastructure (OpenNebula, Ceph, networking).
Write and maintain runbooks, DRP documentation, and postmortem analyses.
Drive proactive improvement: identify reliability risks, performance bottlenecks, and toil — then propose and implement solutions without waiting for incidents.

Infrastructure as Code and Automation (10%)

Develop and maintain Terraform/OpenTofu modules for multi-cloud infrastructure provisioning.
Write Ansible playbooks for bare-metal server configuration and fleet management.
Automate infrastructure lifecycle: PXE boot images, hardware provisioning (Foreman), IPMI management.
Implement FinOps practices: cost attribution, resource utilization analysis, right-sizing recommendations using OpenCost/Kubecost.

Requirements
Must have

5+ years in infrastructure/platform engineering roles, with at least 3 years operating production Kubernetes clusters (not just deploying apps on K8s, but building and managing the platform itself).
Production experience with at least 3 of the following:

KubeVirt or similar VM-on-K8s technology
Cluster API (CAPI) for declarative cluster lifecycle management
Cilium or Calico (advanced CNI with eBPF or BGP integration)
Rook-Ceph or other Kubernetes storage operators at scale (100+ OSDs) ○ ArgoCD or Flux for GitOps-driven infrastructure management

Deep Linux systems knowledge: kernel tuning, networking stack (iptables/nftables, routing, bonding, VLAN), filesystem operations, performance troubleshooting.
Ceph distributed storage experience: cluster operations, OSD lifecycle, pool management, performance tuning, troubleshooting degraded states.
Infrastructure as Code: Terraform/OpenTofu + Ansible at production scale.
Bare-metal infrastructure experience: IPMI/iDRAC, PXE boot, RAID configuration, hardware diagnostics, datacenter operations.
Networking fundamentals: BGP, VLAN, IPSec/WireGuard, DNS, load balancing.
Strong written and verbal English (B2+ minimum) — documentation, postmortems, and cross-team communication are in English.
Proactive mindset: demonstrated history of identifying problems before they become incidents and driving improvements without being asked.

Nice to have

Experience building multi-tenant Kubernetes platforms (vCluster, Capsule, or custom namespace isolation).
Crossplane or similar Kubernetes-native infrastructure abstraction.
Policy-as-Code: Kyverno, OPA Gatekeeper, or Kubewarden.
Container security: image signing (Sigstore/cosign), runtime security (Falco), sandboxed execution (Kata Containers, gVisor).
SRE practices: SLO/SLI design, error budget policies, chaos engineering (LitmusChaos, Chaos Mesh), incident management frameworks.
FinOps: OpenCost, Kubecost, cloud cost optimization.
Immutable OS experience: Talos Linux, Flatcar Container Linux, or similar.
OpenNebula experience (we are migrating FROM it, so understanding it accelerates the transition).
Experience with LINSTOR/DRBD or TopoLVM for local high-performance storage.
SR-IOV and DPDK experience for hardware-accelerated networking .
Experience migrating from traditional virtualization (VMware, OpenNebula, Proxmox) to Kubernetes/KubeVirt.
Grafana LGTM stack (Mimir, Loki, Tempo) for observability.
Compliance environment experience (SOC2, ISO 27001, NIS2).
Go or Python programming for infrastructure tooling.
Experience with Juniper JunOS switch configuration.

What we’re looking for

Proactive mindset. Our current IaaS workload is still around 50% unplanned work, including incidents and ad hoc support requests. We’re looking for someone who can reduce that through better automation, preventive controls, and more resilient systems.
Platform-minded. You look for ways to replace repetitive support work with scalable solutions, for example, building self-service workflows instead of provisioning VMs manually, or introducing automated QoS policies instead of handling limits case by case.
Able to work across the current and future stack. We operate OpenNebula and Ceph today while moving toward a Kubernetes-native platform. This role requires someone who can keep the current environment reliable while helping build the next stage in a practical way.
Transparent in communication. We value technical discussions, architectural decisions, and incident reviews happening in shared channels and documented formats. That includes ADRs, postmortems, and clear written updates.
Focused on knowledge sharing. You document your work, write runbooks as you go, and help make the platform easier for others to operate and support.
Strong English communication. Documentation, postmortems, Jira updates, Slack discussions, and cross-team collaboration are conducted in English.

Benefits

What's in it for you?

A focus on professional development.
Interesting and challenging projects.
Fully remote work with flexible working hours, that allows you to schedule your day and work from any location worldwide.
Paid 24 days of vacation per year, 10 days of national holidays, and unlimited sick leaves.
Compensation for private medical insurance.
Co-working and gym/sports reimbursement.
Budget for education.
The opportunity to receive a reward for the most innovative idea that the company can patent.

By applying for this position, you consent to the processing of your personal data as described in our Privacy Policy (https://cloudlinux.com/candidate-privacy-notice), which provides detailed information on how we maintain and handle your data.

Apply now

Please let CloudLinux know you found this job on Himalayas. This helps us grow!

Apply now

About the job

Apply before

Jun 25, 2026

Posted on

Apr 26, 2026

Hiring timezones

Poland +/- 0 hours

Browse similar jobs

Remote Senior Infra-(Internal-IT) Jobs Remote Other Infra-(Internal-IT) Jobs Remote Senior Infra-(Internal-IT) Jobs in Poland Remote Other Jobs in Poland Remote Infra-(Internal-IT) Jobs in Poland

About CloudLinux

Learn more about CloudLinux and their company culture.

View company profile

CloudLinux is dedicated to enhancing the security, stability, and profitability of Linux for hosting providers and data centers. With a collective experience of over 500 years in Linux, the company is transforming how these entities utilize the technology, extending its benefits to millions of their customers. CloudLinux boasts over 500,000 product installations and serves more than 4,000 customers, including prominent names like Liquid Web, 1&1, and Dell. The company merges profound technical expertise in hosting, kernel development, and open source with exceptional client care. Cloud Linux, Inc. was consolidated into Cloud Linux Software, Inc., which now operates under the TUXCARE trade name (DBA).

The core offering, CloudLinux OS, is specifically engineered for shared hosting environments. It isolates each tenant into a Lightweight Virtualized Environment (LVE), which partitions, allocates, and limits server resources such as CPU, memory, I/O, and the number of processes. This prevents any single user from monopolizing server resources and causing performance degradation or downtime for other users on the same server. This LVE technology is a key differentiator, ensuring a more stable and reliable hosting environment. CloudLinux OS also incorporates features like CageFS, a virtualized file system that encapsulates each user, preventing them from seeing each other's sensitive information or accessing server configuration files. This significantly enhances security in a multi-tenant setup. Furthermore, HardenedPHP ensures the security of the host system by automatically patching older and unsupported PHP versions. The OS is compatible with major control panels like cPanel, Plesk, and DirectAdmin, facilitating easier adoption and management for hosting providers. Beyond the operating system, CloudLinux has expanded its product portfolio with solutions like Imunify360, a comprehensive security suite for Linux web servers, and KernelCare, which provides automated, rebootless kernel patching. The company also initiated AlmaLinux OS, a free, open-source, community-driven enterprise-grade Linux distribution intended as a CentOS alternative, and continues to sponsor the AlmaLinux OS Foundation.

Tech stack

Learn about the tools and technologies that CloudLinux uses to build, market, and sell its products.

View tech stack

Node.js

Python

CentOS

Linux

Red Hat Enterprise Linux

TypeScript

Kubernetes

Webpack

Slite

Ahrefs

Docker

1 more

CloudLinux employees can create an account to update this tech stack.

Employee benefits

Learn about the employee benefits and perks provided at CloudLinux.

View benefits

Competitive pay

CloudLinux offers competitive pay.

Paid vacation

Eligible staffers receive paid vacation.

Medical insurance

Eligible staffers receive medical insurance.

English sessions

CloudLinux offers English language sessions.

View CloudLinux's employee benefits

Apply now

Please let CloudLinux know you found this job on Himalayas. This helps us grow!

Apply now