Justin Yang
@justinyang3
Senior AI infrastructure engineer building reliable MLOps platforms for large-scale GPU and distributed systems.
What I'm looking for
I’m a Senior infrastructure engineer with 10+ years of experience building reliable platforms for large-scale production systems. My work spans container observability and tracing, ML platform development, and GPU cluster infrastructure—always grounded in scalable design and hands-on technical ownership.
At NVIDIA, I built core AI infrastructure across DGX SuperPOD and Base Command Manager, focused on provisioning, workload orchestration, telemetry, reliability, and multitenant GPU platform operations. I automated node bring-up and service registration, translated cluster state into scheduler-ready resources for Kubernetes and Slurm, and implemented health monitoring around GPU, node, and fabric signals so operators could respond faster.
Before that, at Databricks, I developed ML platform capabilities across Feature Store and MLflow Model Registry. I built feature-table and training-set workflows, implemented model packaging with feature lookup metadata, and connected Delta/Spark pipelines to low-latency serving stores for offline-to-online publishing.
Earlier at Datadog, I started in reliability engineering and supported container monitoring and early APM, with ownership across agent operations, ingestion reliability, deployment automation, and production response. I still bring that operational mindset—designing telemetry, automation, and runbooks so distributed systems stay dependable under real-world load.
Experience
Work history, roles, and key accomplishments
Built core AI infrastructure across DGX SuperPOD and Base Command Manager, owning GPU provisioning, workload orchestration, telemetry/health monitoring, and multitenant platform reliability. Designed cluster bring-up and scheduler-facing resource translation for Kubernetes- and Slurm-managed AI jobs, and improved hardware-aware execution across NVLink/NVSwitch and InfiniBand.
Developed ML platform capabilities across Feature Store and MLflow Model Registry, focusing on feature computation, model packaging, metadata, governance, and production-ready workflows. Implemented offline-to-online feature publishing and lineage/metadata capture to connect training artifacts to serving paths without custom retrieval logic.
Owned reliability work for Datadog’s container monitoring and early APM platform, including agent operations, ingestion reliability, deployment automation, and production incident response. Improved container monitoring automation and metadata-driven discovery, and hardened Kubernetes monitoring paths to keep telemetry and traces flowing reliably.
Education
Degrees, certifications, and relevant coursework
University of Texas at Austin
Bachelor of Science, Computer Science
B.S. in Computer Science from The University of Texas at Austin, completed in May 2015.
Tech stack
Software and tools used professionally
Availability
Location
Authorized to work in
Job categories
Skills
Interested in hiring Justin?
You can contact Justin and 90k+ other talented remote workers on Himalayas.
Message JustinFind your dream job
Sign up now and join over 100,000 remote workers who receive personalized job alerts, curated job matches, and more for free!
