NVIDIA is hiring Site Reliability Engineers who want to work on the systems that power everything from large-scale data pipelines to model training clusters to real-time decision making. You'll help design and run NVIDIA's global telemetry backbone and shape how our AI and data systems are built.

Requirements

Bachelor's degree in Computer Science, Engineering, or a related field
10+ years operating large-scale production systems in roles such as SRE, Production Engineer, or Platform Engineer
5+ years designing, building, and running observability platforms at scale
Deep hands-on experience with open-source observability stacks
Strong programming ability in Python and Go
Solid grounding in Linux internals, networking, storage systems, distributed systems, concurrency, and performance engineering
Experience architecting multi-region, multi-tenant telemetry pipelines with high availability and strong durability guarantees
Proven skill in optimizing PromQL, LogQL, trace queries, ingestion paths, indexing strategies, and retention policies
Strong understanding of SLOs, SLIs, error budgets, incident response, and the operational processes that support reliable systems

Benefits

Base salary between 184,000 USD - 287,500 USD
Equity
Benefits

Senior Site Reliability Engineer, Observability

Requirements

Benefits

Apply now

About the job

Apply before

Posted on

Job type

Experience level

Salary

Location requirements

Hiring timezones

Job categories

Skills

About NVIDIA

Apply now

About the job

Apply before

Posted on

Job type

Experience level

Salary

Location requirements

Hiring timezones

Job categories

Skills

NVIDIA

Similar remote jobs

Site Reliability Engineer L5 - Live SRE

Site Reliability Engineering Lead

Staff Site Reliability Engineer

Site Reliability Engineer 5, Core

Principal Site Reliability Engineer

Lead Site Reliability Engineer - Infrastructure

109 remote jobs at NVIDIA

Developer Relations Manager - France and S. Europe

GSI Client Manager

Senior Platform Engineer

Factory Planner, Inventory Control

Senior CPU Performance Developer Technology Engineer

Senior Software Architect, GPU Networking

Find your dream job

Find your dream job

Apply now

Apply now

Developer Relations Manager - France and S. Europe

GSI Client Manager

Senior Platform Engineer

Factory Planner, Inventory Control

Senior CPU Performance Developer Technology Engineer

Senior Software Architect, GPU Networking

Find your dream job

Site Reliability Engineer L5 - Live SRE

Site Reliability Engineering Lead

Staff Site Reliability Engineer

Site Reliability Engineer 5, Core

Principal Site Reliability Engineer

Lead Site Reliability Engineer - Infrastructure