HimalayasHimalayas logo
BL
Open to opportunities

Brian Lindsay

@brianlindsay

I’m a Staff SRE focused on SLO-driven reliability for distributed, Kafka-based systems.

United States
Message

What I'm looking for

I’m looking for a team where I can own SLO-driven reliability for distributed systems—especially Kafka and Kubernetes—improve observability to cut MTTR/MTTD, and collaborate across engineering to ship safer, lower-latency production changes.

I’m a Staff Site Reliability Engineer with 8+ years designing and operating distributed systems for fintech, SaaS, and enterprise environments. I bring end-to-end ownership of production reliability, from architecture and capacity modeling to incident lifecycle management.

Over the past several years, I’ve built SLO/SLI + error budget frameworks that were adopted across 30+ services via phased rollout, improving reliability to 99.9–99.95%. I also re-architected Kafka streaming pipelines to eliminate consumer lag and retry storms by adding partition-level isolation, retry backoff, and backpressure controls—supporting 100K+ events/min throughput.

I improve operational outcomes by consolidating metrics, logs, and traces into unified alerting pipelines, reducing MTTD by 40% through signal-based design and noise reduction. I lead with practical reliability governance—structured postmortem templates with severity classification and action tracking—so teams can reduce repeat Sev-1 incidents while continuously evolving distributed systems.

Experience

Work history, roles, and key accomplishments

CE

Staff Site Reliability Engineer

Certificial

Jan 2026 - May 2026 (4 months)

Owned end-to-end reliability for ingestion and streaming of a high-throughput platform (500K–1M+ transactions/day), including an org-wide SLO/SLI and error-budget framework rolled out across services. Re-architected Kafka streaming with failure isolation, retries, and backpressure, reducing p95 latency by ~35% and improving incident workflow efficiency (MTTD ~40%, MTTR ~30%).

UN

Senior Site Reliability Engineer

Unqork

Jan 2023 - Jan 2026 (3 years)

Owned reliability and infrastructure strategy for a distributed SaaS platform (~99.9% uptime) handling millions of requests/day. Designed Kafka-based async processing (~100K msgs/min) and rebuilt observability with signal-driven SLO alerting, reducing alert fatigue and MTTD by ~40% while leading incident lifecycle and lowering Sev-1 recurrence.

DA

Principal Site Reliability Engineer

DAI

Feb 2021 - Jan 2023 (1 year 11 months)

Led reliability for latency-sensitive financial systems on GCP, improving system resilience with fault-tolerant distributed design patterns and re-architecting a performance layer to reduce database query latency by ~40%. Built observability tooling and coordinated cross-team incident response across distributed ownership boundaries.

TR

Part-time SRE & Software Engineer

The American Board of Radiology

Aug 2014 - Dec 2021 (7 years 4 months)

Progressed from backend engineering into SRE to improve production reliability and system performance, optimizing MERN APIs by reducing bottlenecks via indexing and query tuning. Transitioned into a monitoring and incident-response focus, improving production visibility and reducing time-to-resolution.

Education

Degrees, certifications, and relevant coursework

University of Arizona logoUA

University of Arizona

Bachelor of Science, Computer Science

2009 - 2011

Earned a Bachelor of Science in Computer Science at the University of Arizona from 2009 to 2011.

Find your dream job

Sign up now and join over 100,000 remote workers who receive personalized job alerts, curated job matches, and more for free!

Sign up
Himalayas profile for an example user named Frankie Sullivan