Open to opportunities

Brian Lindsay

@brianlindsay

Message

I’m a Staff SRE focused on SLO-driven reliability for distributed, Kafka-based systems.

United States

Message

What I'm looking for

I’m looking for a team where I can own SLO-driven reliability for distributed systems—especially Kafka and Kubernetes—improve observability to cut MTTR/MTTD, and collaborate across engineering to ship safer, lower-latency production changes.

I’m a Staff Site Reliability Engineer with 8+ years designing and operating distributed systems for fintech, SaaS, and enterprise environments. I bring end-to-end ownership of production reliability, from architecture and capacity modeling to incident lifecycle management.

Over the past several years, I’ve built SLO/SLI + error budget frameworks that were adopted across 30+ services via phased rollout, improving reliability to 99.9–99.95%. I also re-architected Kafka streaming pipelines to eliminate consumer lag and retry storms by adding partition-level isolation, retry backoff, and backpressure controls—supporting 100K+ events/min throughput.

I improve operational outcomes by consolidating metrics, logs, and traces into unified alerting pipelines, reducing MTTD by 40% through signal-based design and noise reduction. I lead with practical reliability governance—structured postmortem templates with severity classification and action tracking—so teams can reduce repeat Sev-1 incidents while continuously evolving distributed systems.

Experience

Work history, roles, and key accomplishments

Staff Site Reliability Engineer

Certificial

Jan 2026 - May 2026 (4 months)

Owned end-to-end reliability for ingestion and streaming of a high-throughput platform (500K–1M+ transactions/day), including an org-wide SLO/SLI and error-budget framework rolled out across services. Re-architected Kafka streaming with failure isolation, retries, and backpressure, reducing p95 latency by ~35% and improving incident workflow efficiency (MTTD ~40%, MTTR ~30%).

Kafka Kubernetes Reliability Engineering Backpressure and Retries

Senior Site Reliability Engineer

Unqork

Jan 2023 - Jan 2026 (3 years)

Owned reliability and infrastructure strategy for a distributed SaaS platform (~99.9% uptime) handling millions of requests/day. Designed Kafka-based async processing (~100K msgs/min) and rebuilt observability with signal-driven SLO alerting, reducing alert fatigue and MTTD by ~40% while leading incident lifecycle and lowering Sev-1 recurrence.

Performance Optimization SLO SLI Alerting Observability Incident Management Reliability Engineering

Principal Site Reliability Engineer

DAI

Feb 2021 - Jan 2023 (1 year 11 months)

Led reliability for latency-sensitive financial systems on GCP, improving system resilience with fault-tolerant distributed design patterns and re-architecting a performance layer to reduce database query latency by ~40%. Built observability tooling and coordinated cross-team incident response across distributed ownership boundaries.

Distributed Systems Fault Tolerance Performance Engineering Observability Incident Response

Part-time SRE & Software Engineer

The American Board of Radiology

Aug 2014 - Dec 2021 (7 years 4 months)

Progressed from backend engineering into SRE to improve production reliability and system performance, optimizing MERN APIs by reducing bottlenecks via indexing and query tuning. Transitioned into a monitoring and incident-response focus, improving production visibility and reducing time-to-resolution.

Incident Response Monitoring API Performance Reliability Engineering

Software Engineer

IBM

Jan 2012 - Aug 2014 (2 years 7 months)

Built backend services and REST APIs for an enterprise platform and supported early-stage scalability and production readiness. Improved performance through service-layer and database optimization.

Performance Optimization REST APIs Backend Engineering Database Optimization Scalability