Terry Fu
@terryfu
Senior platform and infrastructure engineer building reliable multi-cloud distributed systems and developer platforms to boost deployment speed and efficiency.
What I'm looking for
I specialize in large-scale distributed systems, cloud infrastructure, and internal developer platforms, with a consistent focus on reliability, deployment velocity, and infrastructure cost efficiency for data-intensive workloads. At Databricks, I built a multi-cloud workspace provisioning platform that cut setup time from ~3 hours to ~25 minutes and enabled standardized deployments across 15+ regions.
Beyond provisioning, I engineered Kubernetes-based control plane services for Spark cluster lifecycle orchestration (improving startup success from 96% to 99.5%), delivered multi-region high availability with faster incident recovery (~45 minutes to ~8 minutes), and drove compute cost optimization (~22% reduction) while maintaining performance SLAs. I’ve also built observability for provisioning and platform services with Prometheus, Grafana, OpenTelemetry, and centralized logging to improve incident detection (~50%) and reduce MTTR (~40 minutes to ~18 minutes).
Experience
Work history, roles, and key accomplishments
Built a multi-cloud Databricks workspace provisioning platform with Terraform, Python, AWS/Azure APIs, and CI/CD, cutting environment setup from ~3 hours to ~25 minutes across 15+ regions. Engineered Kubernetes-based Spark cluster lifecycle orchestration and multi-region HA, improving startup success from 96% to 99.5% and reducing recovery time from ~45 minutes to ~8 minutes during incidents.
Built and operated a multi-cluster Kubernetes platform on AWS EKS supporting 70+ clusters across 3 regions, enabling ~4 production releases per week per service. Designed a multi-cluster reconciliation control plane and led an EC2 to containerized EKS migration, reducing deployment lead time from ~70 minutes to <10 minutes.
Built and evolved an internal cloud control-plane (Nuage) for self-service provisioning and lifecycle management of LinkedIn Data Infrastructure resources. Delivered platformization primitives and governance (quotas, approvals, ownership metadata) while improving observability and incident-response workflows for a distributed control plane.
Built control plane APIs for Oracle Cloud Infrastructure Compute Classic using Java and REST services to automate compute provisioning and lifecycle management. Implemented golden-image replication and recovery with Solaris Unified Archive and ZFS, and improved high-availability orchestration with Oracle Solaris Cluster technologies.
Education
Degrees, certifications, and relevant coursework
University of California, Berkeley
EECS
2010 - 2012
Earned a Master's degree in EECS at the University of California, Berkeley from 2010 to 2012.
Tech stack
Software and tools used professionally
Splunk
AWS IAM
Amazon CloudWatch
AWS Step Functions
GitHub
GitLab
Kubernetes
Amazon EKS
AWS CodePipeline
Jenkins
CircleCI
GitHub Actions
GitLab CI
MySQL
PostgreSQL
MongoDB
Cassandra
Amazon Route 53
Gmail
Databricks
Terraform
Azure DevOps
Java
Fluentd
Kafka
Grafana
Prometheus
OpenTelemetry
Oracle Solaris
Datadog
Falco
Ansible
AWS Lambda
Amazon Aurora
Airtable
Amazon EventBridge
Plane
Kyverno
ArgoCD
Bash
Availability
Location
Authorized to work in
Job categories
Skills
Interested in hiring Terry?
You can contact Terry and 90k+ other talented remote workers on Himalayas.
Message TerryFind your dream job
Sign up now and join over 100,000 remote workers who receive personalized job alerts, curated job matches, and more for free!
