Skip to main content
BL
Open to opportunities

Brian Lindsay

@brianlindsay

I’m a Staff SRE focused on SLO-driven reliability for distributed, Kafka-based systems.

United States
Message

What I'm looking for

I’m looking for a team where I can own SLO-driven reliability for distributed systems—especially Kafka and Kubernetes—improve observability to cut MTTR/MTTD, and collaborate across engineering to ship safer, lower-latency production changes.

I’m a Staff Site Reliability Engineer with 8+ years designing and operating distributed systems for fintech, SaaS, and enterprise environments. I bring end-to-end ownership of production reliability, from architecture and capacity modeling to incident lifecycle management.

Over the past several years, I’ve built SLO/SLI + error budget frameworks that were adopted across 30+ services via phased rollout, improving reliability to 99.9–99.95%. I also re-architected Kafka streaming pipelines to eliminate consumer lag and retry storms by adding partition-level isolation, retry backoff, and backpressure controls—supporting 100K+ events/min throughput.

I improve operational outcomes by consolidating metrics, logs, and traces into unified alerting pipelines, reducing MTTD by 40% through signal-based design and noise reduction. I lead with practical reliability governance—structured postmortem templates with severity classification and action tracking—so teams can reduce repeat Sev-1 incidents while continuously evolving distributed systems.

Experience

Work history, roles, and key accomplishments

CE

Staff Site Reliability Engineer

Certificial

Jan 2026 - May 2026 (4 months)

Owned end-to-end reliability for ingestion and streaming of a high-throughput platform (500K–1M+ transactions/day), including an org-wide SLO/SLI and error-budget framework rolled out across services. Re-architected Kafka streaming with failure isolation, retries, and backpressure, reducing p95 latency by ~35% and improving incident workflow efficiency (MTTD ~40%, MTTR ~30%).

UN

Senior Site Reliability Engineer

Unqork

Jan 2023 - Jan 2026 (3 years)

Owned reliability and infrastructure strategy for a distributed SaaS platform (~99.9% uptime) handling millions of requests/day. Designed Kafka-based async processing (~100K msgs/min) and rebuilt observability with signal-driven SLO alerting, reducing alert fatigue and MTTD by ~40% while leading incident lifecycle and lowering Sev-1 recurrence.

DA

Principal Site Reliability Engineer

DAI

Feb 2021 - Jan 2023 (1 year 11 months)

Led reliability for latency-sensitive financial systems on GCP, improving system resilience with fault-tolerant distributed design patterns and re-architecting a performance layer to reduce database query latency by ~40%. Built observability tooling and coordinated cross-team incident response across distributed ownership boundaries.

TR

Part-time SRE & Software Engineer

The American Board of Radiology

Aug 2014 - Dec 2021 (7 years 4 months)

Progressed from backend engineering into SRE to improve production reliability and system performance, optimizing MERN APIs by reducing bottlenecks via indexing and query tuning. Transitioned into a monitoring and incident-response focus, improving production visibility and reducing time-to-resolution.

Education

Degrees, certifications, and relevant coursework

University of Arizona logoUA

University of Arizona

Bachelor of Science, Computer Science

2009 - 2011

Earned a Bachelor of Science in Computer Science at the University of Arizona from 2009 to 2011.

Find your dream job

Sign up now and join over 250,000+ remote workers who receive personalized job alerts, curated job matches, and more for free!

Sign up
Himalayas profile for an example user named Frankie Sullivan