HimalayasHimalayas logo
CloudLinuxCL

Senior IaaS / Kubernetes Platform Engineer (worldwide remote, work anywhere)

CloudLinux provides a commercially supported operating system optimized for shared hosting providers and data centers, enhancing server stability, security, and resource management.

CloudLinux

Employee count: 201-500

Poland only

Stay safe on Himalayas

Never send money to companies. Jobs on Himalayas will never require payment from applicants.

CloudLinux is a global remote-first company. We are driven by our principles: do the right thing, employees first, we are remote first, and we deliver high-volume, low-cost Linux infrastructure and security products that help companies to increase the efficiency of their operations. Every person on our team supports each other and does what we can to ensure we all are successful.

Check out our website for more information https://cloudlinux.com/

We are looking for a Senior IaaS / Kubernetes Platform Engineer to join our Infrastructure Department and become a key contributor to the design, implementation, and operation of our private cloud and multi-tenant Kubernetes platform.

Our infrastructure powers 500+ VMs across multiple datacenters, serving 20+ engineering teams. We are in the process of evolving from an OpenNebula-based virtualization platform toward a Kubernetes-native multi-tenant cloud with KubeVirt for VM orchestration — while maintaining reliability and operational excellence throughout the transition.

You will work alongside the existing IaaS Tech Lead and Network Engineer, and must be capable of independently owning and operating the full IaaS stack (compute, storage, networking, bare metal) if needed. This is not a "Kubernetes-only" role — it requires deep infrastructure generalist skills combined with Kubernetes platform expertise.

What You Will Do

Kubernetes Platform Engineering (Primary Focus — 40%)

  • Design, build, and operate a multi-tenant Kubernetes platform using Cluster API (CAPI) with bare-metal providers (Metal3/Sidero).
  • Implement hard multi-tenancy using vCluster (Loft Labs) or similar technology, providing isolated Kubernetes API servers per tenant.
  • Deploy and manage KubeVirt for VM orchestration within Kubernetes, including CPU pinning, NUMA awareness, and HugePages configuration.
  • Implement GitOps-driven infrastructure using ArgoCD or Flux as the single source of truth for all cluster configurations.
  • Deploy and manage Policy-as-Code using Kyverno or OPA Gatekeeper for admission control, resource quotas, and security policies.
  • Build self-service capabilities using Crossplane or similar Kubernetes-native infrastructure provisioning tools.

Storage Engineering (20%)

  • Operate and optimize Ceph distributed storage clusters (currently 1 PiB raw, 149 OSDs, Quincy 17.2.5).
  • Manage Rook-Ceph operator deployments at scale on modern Kubernetes (v1.28+).
  • Implement storage tiering: Ceph for bulk storage, local NVMe for high-IOPS workloads, LINSTOR/DRBD or TopoLVM for ultra-fast replicated storage.
  • Design and implement per-VM / per-tenant I/O isolation on shared Ceph clusters.
  • Manage CDI (Containerized Data Importer) for VM image lifecycle in KubeVirt environments.

Networking (15%)

  • Deploy and manage overlay networks for pod networking, micro-segmentation, and WireGuard/IPsec encryption.
  • Implement Cluster Mesh for multi-datacenter pod-to-pod connectivity.
  • Configure Multus CNI and SR-IOV for multi-NIC VM support in KubeVirt.
  • Work with physical network infrastructure: Juniper switches (JunOS), BGP (eBGP/iBGP), EVPN/VXLAN, VLANs.
  • Maintain IPSec site-to-site connectivity between datacenters.

Reliability and Operations (15%)

  • Practice SRE discipline: define and maintain SLOs with error budgets, implement proactive capacity management with 6-12 month forecasting.
  • Design and execute chaos engineering experiments to validate system resilience.
  • Participate in on-call rotation for IaaS infrastructure (OpenNebula, Ceph, networking).
  • Write and maintain runbooks, DRP documentation, and postmortem analyses.
  • Drive proactive improvement: identify reliability risks, performance bottlenecks, and toil — then propose and implement solutions without waiting for incidents.

Infrastructure as Code and Automation (10%)

  • Develop and maintain Terraform/OpenTofu modules for multi-cloud infrastructure provisioning.
  • Write Ansible playbooks for bare-metal server configuration and fleet management.
  • Automate infrastructure lifecycle: PXE boot images, hardware provisioning (Foreman), IPMI management.
  • Implement FinOps practices: cost attribution, resource utilization analysis, right-sizing recommendations using OpenCost/Kubecost.

Requirements

Must have

  • 5+ years in infrastructure/platform engineering roles, with at least 3 years operating production Kubernetes clusters (not just deploying apps on K8s, but building and managing the platform itself).
  • Production experience with at least 3 of the following:
    • KubeVirt or similar VM-on-K8s technology
    • Cluster API (CAPI) for declarative cluster lifecycle management
    • Cilium or Calico (advanced CNI with eBPF or BGP integration)
    • Rook-Ceph or other Kubernetes storage operators at scale (100+ OSDs) ○ ArgoCD or Flux for GitOps-driven infrastructure management
  • Deep Linux systems knowledge: kernel tuning, networking stack (iptables/nftables, routing, bonding, VLAN), filesystem operations, performance troubleshooting.
  • Ceph distributed storage experience: cluster operations, OSD lifecycle, pool management, performance tuning, troubleshooting degraded states.
  • Infrastructure as Code: Terraform/OpenTofu + Ansible at production scale.
  • Bare-metal infrastructure experience: IPMI/iDRAC, PXE boot, RAID configuration, hardware diagnostics, datacenter operations.
  • Networking fundamentals: BGP, VLAN, IPSec/WireGuard, DNS, load balancing.
  • Strong written and verbal English (B2+ minimum) — documentation, postmortems, and cross-team communication are in English.
  • Proactive mindset: demonstrated history of identifying problems before they become incidents and driving improvements without being asked.

Nice to have

  • Experience building multi-tenant Kubernetes platforms (vCluster, Capsule, or custom namespace isolation).
  • Crossplane or similar Kubernetes-native infrastructure abstraction.
  • Policy-as-Code: Kyverno, OPA Gatekeeper, or Kubewarden.
  • Container security: image signing (Sigstore/cosign), runtime security (Falco), sandboxed execution (Kata Containers, gVisor).
  • SRE practices: SLO/SLI design, error budget policies, chaos engineering (LitmusChaos, Chaos Mesh), incident management frameworks.
  • FinOps: OpenCost, Kubecost, cloud cost optimization.
  • Immutable OS experience: Talos Linux, Flatcar Container Linux, or similar.
  • OpenNebula experience (we are migrating FROM it, so understanding it accelerates the transition).
  • Experience with LINSTOR/DRBD or TopoLVM for local high-performance storage.
  • SR-IOV and DPDK experience for hardware-accelerated networking .
  • Experience migrating from traditional virtualization (VMware, OpenNebula, Proxmox) to Kubernetes/KubeVirt.
  • Grafana LGTM stack (Mimir, Loki, Tempo) for observability.
  • Compliance environment experience (SOC2, ISO 27001, NIS2).
  • Go or Python programming for infrastructure tooling.
  • Experience with Juniper JunOS switch configuration.

What we’re looking for

  • Proactive mindset. Our current IaaS workload is still around 50% unplanned work, including incidents and ad hoc support requests. We’re looking for someone who can reduce that through better automation, preventive controls, and more resilient systems.
  • Platform-minded. You look for ways to replace repetitive support work with scalable solutions, for example, building self-service workflows instead of provisioning VMs manually, or introducing automated QoS policies instead of handling limits case by case.
  • Able to work across the current and future stack. We operate OpenNebula and Ceph today while moving toward a Kubernetes-native platform. This role requires someone who can keep the current environment reliable while helping build the next stage in a practical way.
  • Transparent in communication. We value technical discussions, architectural decisions, and incident reviews happening in shared channels and documented formats. That includes ADRs, postmortems, and clear written updates.
  • Focused on knowledge sharing. You document your work, write runbooks as you go, and help make the platform easier for others to operate and support.
  • Strong English communication. Documentation, postmortems, Jira updates, Slack discussions, and cross-team collaboration are conducted in English.

Benefits

What's in it for you?

  • A focus on professional development.
  • Interesting and challenging projects.
  • Fully remote work with flexible working hours, that allows you to schedule your day and work from any location worldwide.
  • Paid 24 days of vacation per year, 10 days of national holidays, and unlimited sick leaves.
  • Compensation for private medical insurance.
  • Co-working and gym/sports reimbursement.
  • Budget for education.
  • The opportunity to receive a reward for the most innovative idea that the company can patent.

By applying for this position, you consent to the processing of your personal data as described in our Privacy Policy (https://cloudlinux.com/candidate-privacy-notice), which provides detailed information on how we maintain and handle your data.

About the job

Apply before

Posted on

Job type

Other

Experience level

Location requirements

Hiring timezones

Poland +/- 0 hours

About CloudLinux

Learn more about CloudLinux and their company culture.

View company profile

CloudLinux is dedicated to enhancing the security, stability, and profitability of Linux for hosting providers and data centers. With a collective experience of over 500 years in Linux, the company is transforming how these entities utilize the technology, extending its benefits to millions of their customers. CloudLinux boasts over 500,000 product installations and serves more than 4,000 customers, including prominent names like Liquid Web, 1&1, and Dell. The company merges profound technical expertise in hosting, kernel development, and open source with exceptional client care. Cloud Linux, Inc. was consolidated into Cloud Linux Software, Inc., which now operates under the TUXCARE trade name (DBA).

The core offering, CloudLinux OS, is specifically engineered for shared hosting environments. It isolates each tenant into a Lightweight Virtualized Environment (LVE), which partitions, allocates, and limits server resources such as CPU, memory, I/O, and the number of processes. This prevents any single user from monopolizing server resources and causing performance degradation or downtime for other users on the same server. This LVE technology is a key differentiator, ensuring a more stable and reliable hosting environment. CloudLinux OS also incorporates features like CageFS, a virtualized file system that encapsulates each user, preventing them from seeing each other's sensitive information or accessing server configuration files. This significantly enhances security in a multi-tenant setup. Furthermore, HardenedPHP ensures the security of the host system by automatically patching older and unsupported PHP versions. The OS is compatible with major control panels like cPanel, Plesk, and DirectAdmin, facilitating easier adoption and management for hosting providers. Beyond the operating system, CloudLinux has expanded its product portfolio with solutions like Imunify360, a comprehensive security suite for Linux web servers, and KernelCare, which provides automated, rebootless kernel patching. The company also initiated AlmaLinux OS, a free, open-source, community-driven enterprise-grade Linux distribution intended as a CentOS alternative, and continues to sponsor the AlmaLinux OS Foundation.

Employee benefits

Learn about the employee benefits and perks provided at CloudLinux.

View benefits

Competitive pay

CloudLinux offers competitive pay.

Paid vacation

Eligible staffers receive paid vacation.

Medical insurance

Eligible staffers receive medical insurance.

English sessions

CloudLinux offers English language sessions.

View CloudLinux's employee benefits
Claim this profileCloudLinux logoCL

CloudLinux

View company profile

Similar remote jobs

Here are other jobs you might want to apply for.

View all remote jobs

4 remote jobs at CloudLinux

Explore the variety of open remote roles at CloudLinux, offering flexible work options across multiple disciplines and skill levels.

View all jobs at CloudLinux

Remote companies like CloudLinux

Find your next opportunity by exploring profiles of companies that are similar to CloudLinux. Compare culture, benefits, and job openings on Himalayas.

View all companies

Find your dream job

Sign up now and join over 100,000 remote workers who receive personalized job alerts, curated job matches, and more for free!

Sign up
Himalayas profile for an example user named Frankie Sullivan