Open to opportunities

Osborne Andrew

@osborneandrew

Senior DevOps & Platform Engineer architecting Kubernetes-native AI/ML infrastructure—cutting costs, accelerating releases, and ensuring high-SLA reliability.

United States

Message

What I'm looking for

I’m looking to lead Kubernetes-native AI/ML platform engineering—owning reliability, security, and FinOps—while building developer self-service that speeds delivery with measurable cost and SLA improvements.

I’m a Senior DevOps and Platform Engineer focused on designing and operating enterprise AI/ML infrastructure across AWS, Azure, and GCP. At GitLab, I lead AI infrastructure for the GitLab Duo Agent Platform—architecting Kubernetes-native MLOps systems that support 100K+ daily operations with 99.999% SLA on mission-critical LLM workloads.

I turn infrastructure complexity into measurable outcomes: $450K+ annual cloud savings, 40% faster release cycles, and a 3× reduction in P1 incidents year-over-year. I’ve built and scaled RAG and inference platforms (vLLM + Triton, Istio routing, KEDA autoscaling), implemented zero-trust security with Vault, OPA/Gatekeeper, and OIDC workload identity federation to reach SOC2 Type II compliance, and automated multi-region disaster recovery to maintain sub-15-minute RTO.

Experience

Work history, roles, and key accomplishments

Current

Senior DevOps / Platform Engineer

Current

GitLab

Mar 2020 - Present (6 years 4 months)

Architected Kubernetes-native MLOps for GitLab Duo across AWS EKS and Azure AKS, automating model training and inference orchestration and reducing provisioning from days to under 10 minutes while supporting 29% YoY growth without SLA regression. Delivered multi-cloud LLM inference with vLLM/Triton and zero-trust security (Vault, OPA/Gatekeeper, OIDC) plus multi-region disaster recovery, achieving

Kubernetes Terraform VLLM Triton Istio KEDA Backstage RAG (Pinecone Weaviate)

DevOps Engineer

HashiCorp

Jul 2017 - Feb 2020 (2 years 7 months)

Designed 40+ production-grade Terraform modules for AWS/Azure/GCP adopted by 100+ enterprise clients, cutting environment onboarding from 3 weeks to under 4 days. Built Vault dynamic secrets and optimized GitHub Actions + Terraform Cloud CI/CD with Sentinel policy-as-code, reducing deployment errors by 70% and cutting infrastructure change cycles from 3 hours to 28 minutes.

Terraform GitHub Actions Policy as Code Infracost CI CD Pipelines AWS STS PKI

Cloud Engineer

RunPod

Oct 2016 - Mar 2017 (5 months)

Operated GPU infrastructure for A100/H100/H200 workloads, using Terraform and Ansible to reduce pod spin-up times by 50%+ for enterprise AI teams. Designed serverless LLM inference with KEDA autoscaling and NVMe model weight caching to achieve sub-3-second cold starts, saving customers an average of $120K/month in wasted compute.

Terraform Ansible NVMe Caching Spot Instance Optimization NCCL RDMA Tuning DCGM GPU Exporter