HimalayasHimalayas logo
JY
Open to opportunities

Justin Yang

@justinyang3

Senior AI infrastructure engineer building reliable MLOps platforms for large-scale GPU and distributed systems.

United States
Message

What I'm looking for

I’m looking for a role where I can own scalable AI infrastructure—GPU provisioning, orchestration, telemetry, and multitenant reliability—paired with strong operational standards and teams that value automation, clarity, and measurable production outcomes.

I’m a Senior infrastructure engineer with 10+ years of experience building reliable platforms for large-scale production systems. My work spans container observability and tracing, ML platform development, and GPU cluster infrastructure—always grounded in scalable design and hands-on technical ownership.

At NVIDIA, I built core AI infrastructure across DGX SuperPOD and Base Command Manager, focused on provisioning, workload orchestration, telemetry, reliability, and multitenant GPU platform operations. I automated node bring-up and service registration, translated cluster state into scheduler-ready resources for Kubernetes and Slurm, and implemented health monitoring around GPU, node, and fabric signals so operators could respond faster.

Before that, at Databricks, I developed ML platform capabilities across Feature Store and MLflow Model Registry. I built feature-table and training-set workflows, implemented model packaging with feature lookup metadata, and connected Delta/Spark pipelines to low-latency serving stores for offline-to-online publishing.

Earlier at Datadog, I started in reliability engineering and supported container monitoring and early APM, with ownership across agent operations, ingestion reliability, deployment automation, and production response. I still bring that operational mindset—designing telemetry, automation, and runbooks so distributed systems stay dependable under real-world load.

Experience

Work history, roles, and key accomplishments

Nvidia logoNV
Current

Senior Software Engineer

Jan 2022 - Present (4 years 4 months)

Built core AI infrastructure across DGX SuperPOD and Base Command Manager, owning GPU provisioning, workload orchestration, telemetry/health monitoring, and multitenant platform reliability. Designed cluster bring-up and scheduler-facing resource translation for Kubernetes- and Slurm-managed AI jobs, and improved hardware-aware execution across NVLink/NVSwitch and InfiniBand.

Databricks logoDA

Software Engineer - ML Platform

Jul 2018 - Dec 2021 (3 years 5 months)

Developed ML platform capabilities across Feature Store and MLflow Model Registry, focusing on feature computation, model packaging, metadata, governance, and production-ready workflows. Implemented offline-to-online feature publishing and lineage/metadata capture to connect training artifacts to serving paths without custom retrieval logic.

Datadog logoDA

Site Reliability Engineer

Jun 2015 - Jun 2018 (3 years)

Owned reliability work for Datadog’s container monitoring and early APM platform, including agent operations, ingestion reliability, deployment automation, and production incident response. Improved container monitoring automation and metadata-driven discovery, and hardened Kubernetes monitoring paths to keep telemetry and traces flowing reliably.

Education

Degrees, certifications, and relevant coursework

University of Texas at Austin logoUA

University of Texas at Austin

Bachelor of Science, Computer Science

B.S. in Computer Science from The University of Texas at Austin, completed in May 2015.

Find your dream job

Sign up now and join over 100,000 remote workers who receive personalized job alerts, curated job matches, and more for free!

Sign up
Himalayas profile for an example user named Frankie Sullivan