NVIDIA is hiring Site Reliability Engineers who want to work on the systems that power everything from large-scale data pipelines to model training clusters to real-time decision making. You'll help design and run NVIDIA's global telemetry backbone and shape how our AI and data systems are built.
Requirements
- Bachelor's degree in Computer Science, Engineering, or a related field
- 10+ years operating large-scale production systems in roles such as SRE, Production Engineer, or Platform Engineer
- 5+ years designing, building, and running observability platforms at scale
- Deep hands-on experience with open-source observability stacks
- Strong programming ability in Python and Go
- Solid grounding in Linux internals, networking, storage systems, distributed systems, concurrency, and performance engineering
- Experience architecting multi-region, multi-tenant telemetry pipelines with high availability and strong durability guarantees
- Proven skill in optimizing PromQL, LogQL, trace queries, ingestion paths, indexing strategies, and retention policies
- Strong understanding of SLOs, SLIs, error budgets, incident response, and the operational processes that support reliable systems
Benefits
- Base salary between 184,000 USD - 287,500 USD
- Equity
- Benefits
