Open to opportunities

Justin Yang

@justinyang3

Message

Senior AI infrastructure engineer building reliable MLOps platforms for large-scale GPU and distributed systems.

United States

Message

What I'm looking for

I’m looking for a role where I can own scalable AI infrastructure—GPU provisioning, orchestration, telemetry, and multitenant reliability—paired with strong operational standards and teams that value automation, clarity, and measurable production outcomes.

I’m a Senior infrastructure engineer with 10+ years of experience building reliable platforms for large-scale production systems. My work spans container observability and tracing, ML platform development, and GPU cluster infrastructure—always grounded in scalable design and hands-on technical ownership.

At NVIDIA, I built core AI infrastructure across DGX SuperPOD and Base Command Manager, focused on provisioning, workload orchestration, telemetry, reliability, and multitenant GPU platform operations. I automated node bring-up and service registration, translated cluster state into scheduler-ready resources for Kubernetes and Slurm, and implemented health monitoring around GPU, node, and fabric signals so operators could respond faster.

Before that, at Databricks, I developed ML platform capabilities across Feature Store and MLflow Model Registry. I built feature-table and training-set workflows, implemented model packaging with feature lookup metadata, and connected Delta/Spark pipelines to low-latency serving stores for offline-to-online publishing.

Earlier at Datadog, I started in reliability engineering and supported container monitoring and early APM, with ownership across agent operations, ingestion reliability, deployment automation, and production response. I still bring that operational mindset—designing telemetry, automation, and runbooks so distributed systems stay dependable under real-world load.

Experience

Work history, roles, and key accomplishments

Current

Senior Software Engineer

Current

Nvidia

Jan 2022 - Present (4 years 6 months)

Built core AI infrastructure across DGX SuperPOD and Base Command Manager, owning GPU provisioning, workload orchestration, telemetry/health monitoring, and multitenant platform reliability. Designed cluster bring-up and scheduler-facing resource translation for Kubernetes- and Slurm-managed AI jobs, and improved hardware-aware execution across NVLink/NVSwitch and InfiniBand.

Kubernetes Slurm Telemetry Multitenant GPU Platform NVSwitch Driver And Firmware Rollout Automation

Software Engineer - ML Platform

Databricks

Jul 2018 - Dec 2021 (3 years 5 months)

Developed ML platform capabilities across Feature Store and MLflow Model Registry, focusing on feature computation, model packaging, metadata, governance, and production-ready workflows. Implemented offline-to-online feature publishing and lineage/metadata capture to connect training artifacts to serving paths without custom retrieval logic.

Delta Lake Apache Spark Offline To Online Feature Serving Feature Lineage And Metadata CI CD Friendly Registry Workflows

Site Reliability Engineer

Datadog

Jun 2015 - Jun 2018 (3 years)

Owned reliability work for Datadog’s container monitoring and early APM platform, including agent operations, ingestion reliability, deployment automation, and production incident response. Improved container monitoring automation and metadata-driven discovery, and hardened Kubernetes monitoring paths to keep telemetry and traces flowing reliably.

Container Monitoring Agent Rollout Automation Kafka Based Telemetry Ingestion RocksDB Backed Pipelines