The Cloud Engineer - Senior (Observability) supports the SEC ISS contract by engineering,operating, and continuously improving the enterprise observability platform across hybrid cloud and containerized environments. This role is hands-on: instruments services with distributed tracing, code-level profiling, and custom metrics; builds and tunes Datadog (or comparable) dashboards, alerts, APM, log pipelines, RUM, and synthetic monitors; then uses that telemetry to solve production performance, reliability, and capacity problems. The engineer partners with cloud, platform, and application teams to embed observability into Azure, AWS, and container platforms (OpenShift/Kubernetes), and drives reduction of alert noise, mean time to detect (MTTD), and mean time to resolve (MTTR). This position provides senior technical leadership for APM/distributed tracing strategy, SLO/SLI engineering, and data-driven operational decision-making in a 24x7x365 operating environment.

PRIMARY RESPONSIBILITIES

Observability Platform Engineering

Engineer andoperatethe enterprise observability stack (Datadog or comparable), including metrics, logs, traces, APM, RUM, synthetic monitoring, and network performance monitoring.

Build, tune, andmaintaindashboards, monitors, SLOs/SLIs, and alerting policies that produce actionable signal and minimize noise.

Instrument services, infrastructure, and containerized workloads using agents,OpenTelemetry, and language-specific APM tracers (Java, .NET, Python, Node.js, Go) with consistent span tagging, W3CTraceContextpropagation, and unified service tagging across the estate.

Develop andmaintainintegrations between observability platforms, ITSM (ServiceNow), CI/CD pipelines, and on-call/paging workflows.

Define and enforce a unified tagging standard (environment, service, version, team/ownership, data classification, cost center) across metrics, logs, and traces; manage tag cardinality, governance, and custom business tags to keep telemetryqueryable, attributable, and cost-controlled.

Cloud and Container Monitoring Engineering

Design and deliver monitoring coverage for Microsoft Azure and AWS workloads, including PaaS services, serverless, networking, identity, managed databases, and cloud-native data services.

Engineer managed database observability across AWS RDS/Aurora (MySQL, PostgreSQL, SQL Server, Oracle), Azure SQL/PostgreSQL/MySQL, and NoSQL/cache services (DynamoDB, Cosmos DB,ElastiCache/Redis), including query-level performance analytics, slow-query and execution-plan capture, lock/deadlock/wait analysis, connection pool and session monitoring, replication lag, storage/IOPS saturation, and backup/HA health -- correlating database spans with upstream APM traces.

Engineer container-platform observability for OpenShift/Kubernetes, covering cluster health, control plane, nodes, pods, namespaces, ingress, service mesh, and workload APM.

Build standardized, reusable monitoring modules deployable via infrastructure-as-code (Terraform, Bicep, ARM) and CI/CD.

Support hybrid visibility across on-premises, cloud, and containerized workloads with correlated telemetry.

Performance Engineering and Problem Solving

Lead data-driven investigation and resolution of complex performance, latency, saturation, and reliability issues across the estate.

Use APM distributed traces, service/dependency maps, continuous code profiling (CPU, memory, lock contention), database query analytics, exception/error tracking, and RUM-to-backend trace correlation to isolate bottlenecks in applications, platforms, middleware, and downstream dependencies.

Partner with engineering teams to define and implement remediation, tuning, and architectural improvements based on telemetry evidence.

Define and implement trace-based SLOs, deployment tracking, and change-correlation workflows so performance regressions are detected and attributed to specific releases, versions, or configuration changes.

Provide senior technical leadership during major incidents, delivering impact analysis, contributing to root-cause analysis, and owning post-incident observability gaps.

Capacity, Reliability, and Continuous Improvement

Analyze operational telemetry and trend data toidentifycapacity risks, recurring constraints, and opportunities for efficiency.

Build andmaintaincapacity and performance dashboards and reports that communicate posture, risk, and recommendations to technical and leadership stakeholders.

Define capacity thresholds, alert baselines, and trigger points for scaling, technology refresh, and resource reallocation.

Drive continuous improvement of observability coverage, alert quality, runbook linkage, and operational maturity aligned to SEC SLA/KPI expectations.

REQUIRED QUALIFICATIONS

Citizenship/Work Authorization: Must meet contract requirements.

Clearance: Ability to obtain andmaintainSEC Public Trust (or higher ifrequired).

EXPERIENCE

Minimum 8 years of experience in IT infrastructure or platform engineering roles, including 5+ years focused on observability, performance engineering, or site reliability engineering.

Demonstratedexperience engineering andoperatingan enterprise observability platform (Datadog strongly preferred; equivalent experience with Dynatrace, New Relic, Splunk Observability, or Grafana/Prometheus stacks considered).

Proven experience building APM and distributed tracing coverage for production multi-tier applications -- including language-specific tracer deployment, custom instrumentation of business transactions, service/dependency mapping, continuous profiling, and RUM-to-backend trace correlation -- across cloud and containerized workloads.

Proven experience leading complex production performance and reliability problem-solving from telemetry to remediation.

Hands-on experience monitoring Kubernetes or OpenShift clusters and containerized workloads in production.

TECHNICAL SKILLS

Enterprise observability platforms (Datadog or comparable): metrics, logs, traces, APM, RUM, synthetic, NPM

Instrumentation withOpenTelemetry, Datadog agents/SDKs, and language-specific APM tracers (Java, .NET, Python, Node.js, Go) including custom spans, trace sampling strategies, W3CTraceContextpropagation, and continuous profiling

Microsoft Azure and AWS monitoring services and integrations (Azure Monitor, Log Analytics, CloudWatch, AWS X-Ray)

Container and Kubernetes/OpenShift observability, including cluster, workload, and service mesh telemetry

Cloud database monitoring: AWS RDS/Aurora (including Performance Insights), Azure SQL/PostgreSQL/MySQL (Query Performance Insight), and NoSQL/cache (DynamoDB, Cosmos DB,ElastiCache/Redis); query-level performance tuning, execution-plan analysis, and Datadog DBM or equivalent deep database APM

Infrastructure-as-code for monitoring (Terraform, Bicep, ARM) and CI/CD-driven monitor/dashboard deployment

APM and distributed tracing: service/dependency maps, trace analytics, RUM-to-backend correlation, exception/error tracking, deployment tracking, and trace-based SLOs

Unified tagging strategy and cardinality governance across metrics/logs/traces (environment, service, version, ownership, data classification, cost center), including custom tag enrichment and tag-driven access/cost controls

Alert engineering, SLO/SLI design, error budget management, and alert-noise reduction

Performance engineering, capacity analysis, and telemetry-driven root-cause analysis

Integration of observability with ITSM (ServiceNow) and on-call/paging workflows

PREFERRED QUALIFICATIONS

Experience supporting federal agency IT environments under FISMA/FedRAMP/NIST-aligned security and compliance requirements.

Datadog certification (Fundamentals and/or Administrator) or comparable enterprise observability certification.

Hands-on experience with Red Hat OpenShift Virtualization (CNV/KubeVirt) orotherKubeVirt-based container virtualization observability.

Experience witheBPF-based observability tooling and service mesh telemetry (Istio,Linkerd).

Experience implementing SLOs and error budgets at enterprise scale and integrating them into operational governance.

Experience with cost-aware observability practices, including telemetry volume optimization and retention tuning.

Experience integrating observability outputs with executive reporting, SLA/KLI dashboards, and capacity forecasting.

- ITIL 4 Foundation

AWS Certified Solutions Architect - Associate (or higher)

Microsoft Certified: Azure Administrator Associate (or higher)

Red Hat Certified Specialist in OpenShift Administration (or equivalent)

-HashiCorpTerraform Associate

WORK ENVIRONMENT / OTHER

Operational Support: Supports a 24x7x365 operating environment;participatesin a defined on-call rotation and may require surge support based on operational needs.

Location: Telework

Travel: As required per contract direction.

EDUCATION & EXPERIENCE

BS and 4 – 8 years of prior relevant experience or Masters with 2 – 6 years of prior relevant experience. Preferred degree in a relevant field (e.g., Information Technology, Computer Science, Engineering).

If you're looking for comfort, keep scrolling. At Leidos, we outthink, outbuild, and outpace the status quo — because the mission demands it. We're not hiring followers. We're recruiting the ones who disrupt, provoke, and refuse to fail. Step 10 is ancient history. We're already at step 30 — and moving faster than anyone else dares.

Original Posting:

May 19, 2026

For U.S. Positions: While subject to change based on business needs, Leidos reasonably anticipates that this job requisition will remain open for at least 3 days with an anticipated close date of no earlier than 3 days after the original posting date as listed above.

Pay Range:

Pay Range $87,100.00 - $157,450.00

The Leidos pay range for this job level is a general guideline only and not a guarantee of compensation or salary. Additional factors considered in extending an offer include (but are not limited to) responsibilities of the job, education, experience, knowledge, skills, and abilities, as well as internal equity, alignment with market data, applicable bargaining agreement (if any), or other law.

Leidos' story begins in 1969 when Dr. J. Robert Beyster, a visionary scientist, founded Science Applications Incorporated (SAI) in La Jolla, San Diego, California. With a modest investment and a powerful idea, Dr. Beyster embarked on a journey to apply scientific expertise to solve complex problems. The company's early days were marked by a focus on research and engineering, tackling challenges for various government and commercial clients. One of its initial significant projects involved studying radiation-based cancer therapy for the Los Alamos National Laboratory, which laid the groundwork for Leidos' future health business. SAI soon expanded its reach, opening an office in Albuquerque to support the Air Force Weapons Laboratory's work on electromagnetic phenomena, a precursor to the company's Physical Science Group.

Throughout the 1980s, the company, then known as Science Applications International Corporation (SAIC), strategically shifted its focus towards national security and defense, solidifying its position as a key government services provider. This era set the stage for substantial growth and diversification. The 1990s saw SAIC continue to expand its offerings and international presence, securing its first major global contract with the Kuwaiti Defense Forces. A pivotal moment arrived in 2013 when SAIC underwent a significant transformation, splitting into two independent, publicly traded companies: a new company retaining the SAIC name and the original company, which was rebranded as Leidos (a name derived from 'kaleidoscope'). Leidos, as the legal successor to the original SAIC, inherited its pre-2013 stock price and corporate filing history and established its new headquarters in Reston, Virginia. This strategic move allowed Leidos to sharpen its focus on national security, health, and engineering solutions. Another major milestone occurred in August 2016 when Leidos merged with Lockheed Martin's Information Systems & Global Solutions (IS&GS) business, a landmark transaction that created the defense industry's largest IT services provider and significantly expanded Leidos' capabilities and market share. Today, Leidos stands as a Fortune 500® global science and technology leader, employing approximately 47,000 people worldwide and generating billions in annual revenue, committed to making the world safer, healthier, and more efficient through innovation and technology.

Cloud Engineer - Senior (Observability)

PRIMARY RESPONSIBILITIES

Observability Platform Engineering

Cloud and Container Monitoring Engineering

Performance Engineering and Problem Solving

Capacity, Reliability, and Continuous Improvement

REQUIRED QUALIFICATIONS

EXPERIENCE

TECHNICAL SKILLS

PREFERRED QUALIFICATIONS

- ITIL 4 Foundation

-HashiCorpTerraform Associate

WORK ENVIRONMENT / OTHER

Location: Telework

EDUCATION & EXPERIENCE

Original Posting:

Pay Range:

Apply now

About the job

Apply before

Posted on

Job type

Experience level

Salary

Education

Experience

Location requirements

Hiring timezones

Job categories

Skills

Browse similar jobs

About Leidos

Tech stack

Employee benefits

Paid sick days

Health Insurance

Dental Insurance

Vision Insurance

Apply now

About the job

Apply before

Posted on

Job type

Experience level

Salary

Education

Experience

Location requirements

Hiring timezones

Job categories

Skills

Browse similar jobs

Leidos

Company size

Founded in

Chief executive officer

Markets

Employees live in

Similar remote jobs

Software Architect, Reliability Engineering

Observability & Operations Engineer

Senior Cloud Engineer

Architect

Senior Sire Reliability Engineer

Site Reliability Engineer

80 remote jobs at Leidos

Remote EHV Transmission Line Engineer - Design & Analysis

Electrical Engineer

Substation Project Engineer

Engineering and Business Process Intern

Azure Cloud Engineer

Remote Transmission Line Engineer - EHV Design & PLS-CADD

Remote companies like Leidos

Two Six Technologies

Remote companies like Leidos

Find your dream job

Find your dream job

Apply now

Apply now

Software Architect, Reliability Engineering