Essential Functions:
- Partner with software developers, platform engineers, and IT staff to improve system design, operability, deployment safety, and production support readiness.
- Define and maintain operational standards, runbooks, support procedures, escalation paths, and service-level objectives.
- Evaluate system architecture and changes to ensure they balance functional requirements, service quality, reliability, security, and compliance needs.
- Drive continuous improvement in platform stability, maintenance, and availability.
- Provide advanced technical support and troubleshooting for complex platform and service issues affecting internal users and stakeholders.
Experience and Skills Required:
- 8+ years of experience in Site Reliability Engineering, DevOps, Platform Engineering, Systems Engineering, or related infrastructure roles supporting production services.
- Strong experience with Linux systems administration and troubleshooting in enterprise environments.
- Strong experience operating and maintaining on-prem Kubernetes platforms and all related components including CRI, CNI, and CSI plugins.
- Experience deploying and maintaining applications on Kubernetes using Helm, Kustomize, and similar tooling.
- Experience supporting DevOps tooling such as GitLab, Artifactory, Jira, Confluence.
- Experience with GitOps tools such as FluxCD or ArgoCD.
- Proficiency scripting with at least one of Python, Go, or Bash.
- Strong experience designing, maintaining, and maturing observability tooling including monitoring, dashboards, logging and tracing, and supporting SLOs.
- Strong understanding of reliability engineering concepts:
- Service health indicators
- High availability design, failure reduction, and testing
- Operational readiness practices, including developing documentation, runbooks, and architectural descriptions
- Incident response, root cause analysis, remediation/recovery
- Ability to obtain a security clearance, which includes U.S. citizenship.
Preferred:
- Experience with multiple Linux distributions including Ubuntu.
- Experience with at least one of the following: Tanzu Kubernetes, Nutanix Kubernetes Platform, Canonical Kubernetes.
- Experience with cloud platforms such as AWS and Azure.
- Experience with infrastructure automation and configuration management.
- Experience managing AI tooling on Kubernetes including MCP Servers, LLM platforms (vLLM, Ollama), Kubeflow.
- Experience with security and compliance considerations in regulated environments.
- DoD experience.
- Active or inactive Secret Security Clearance.
Education:
- Bachelor’s degree in CS, Software Engineering or other IT-related field or equivalent experience
REMOTE WORK NOTICE:This position may be performed fully remote, hybrid, or onsite at an ARA office. Preference will be given to candidates located onsite in the Albuquerque area.
