Responsibilities
- Own ultra-low-latency EC2 fleets - Design cluster placement groups with ENA / SR-IOV networking.
- Kernel-level performance tuning - Apply CPU pinning, NUMA alignment, IRQ affinity, hugepages, and TCP/UDP sysctl tweaks to flatten tail latency.
- Immutable infrastructure & automated rollouts - Build Packer AMIs and Terraform Auto Scaling Groups; run GitLab/Jenkins pipelines with blue-green or canary deploys and sub-2-minute automatic rollbacks.
- High-throughput messaging & gateways - Operate Kafka clusters (partition/ISR tuning, rack awareness) and Nginx WebSocket edges serving 100 k + clients with single-digit-ms fan-out.
- Network integrity - Run packet-loss analysis and MTU/ECN/queue-depth tuning; enforce least-privilege security-group micro-segmentation.
- Observability & SLO stewardship - Instrument Prometheus/Grafana dashboards for order-ack latency, queue depth, reject rate; write Alertmanager rules driven by p95/p99 error-budget burn.
- Reliability testing & incident response - Schedule chaos/load drills; take part in 24 × 7 on-call, use perf/eBPF/FlameGraphs/tcpdump for µs-level RCA, and publish post-mortems with remediation actions.
- Capacity planning around macro events - Pre-warm spot pools and leverage Savings Plans to balance headroom and cost.
- Automation & tooling - Write Go/Python scripts for bootstrap, health probes, latency regression tests, and one-click remediation.
- Cross-team collaboration - Pair with Java/Rust engineers and quants to profile hot-path code, and eliminate bottlenecks without trading downtime.
Requirements
- Linux low-latency tuning – CPU pinning, NUMA awareness, IRQ affinity, TCP/UDP stack tweaks, hugepages
- AWS operations at scale – EKS, EC2, VPC, NLB/ALB, Auto Scaling, multi-AZ fail-over, cost & quota managementInfrastructure as Code / GitOps – Terraform (modular state)
- CI/CD pipelines – GitLab CI or Jenkins; blue-green / canary deploys, sub-2-minute rollbacks, latency smoke-test gates
- Observability – Prometheus + Grafana, Alertmanager, high-cardinality metrics, centralized log aggregation, eBPF tracing for µs-level hotspots
- High-throughput messaging – Kafka cluster operations (partition strategy, ISR tuning, < 3 ms end-to-end), Nginx WebSocket terminationTrading-grade networking – ENA/SR-IOV, packet-loss analysis, security-group hardening
- Performance & reliability engineering – perf, FlameGraph, chaos/load testing, p95/p99 latency SLO ownership
- Automation & scripting – Python or Go for tooling, incident remediation, environment bootstrap
- Bonus – Rust/Go code familiarity, CNCF/AWS certifications, XDP/DPDK experience for kernel-bypass networking
Binance is committed to being an equal opportunity employer. We believe that having a diverse workforce is fundamental to our success.By submitting a job application, you confirm that you have read and agree to our Candidate Privacy Notice.