Brian Lindsay
@brianlindsay
I’m a Staff SRE focused on SLO-driven reliability for distributed, Kafka-based systems.
What I'm looking for
I’m a Staff Site Reliability Engineer with 8+ years designing and operating distributed systems for fintech, SaaS, and enterprise environments. I bring end-to-end ownership of production reliability, from architecture and capacity modeling to incident lifecycle management.
Over the past several years, I’ve built SLO/SLI + error budget frameworks that were adopted across 30+ services via phased rollout, improving reliability to 99.9–99.95%. I also re-architected Kafka streaming pipelines to eliminate consumer lag and retry storms by adding partition-level isolation, retry backoff, and backpressure controls—supporting 100K+ events/min throughput.
I improve operational outcomes by consolidating metrics, logs, and traces into unified alerting pipelines, reducing MTTD by 40% through signal-based design and noise reduction. I lead with practical reliability governance—structured postmortem templates with severity classification and action tracking—so teams can reduce repeat Sev-1 incidents while continuously evolving distributed systems.
Experience
Work history, roles, and key accomplishments
Staff Site Reliability Engineer
Certificial
Jan 2026 - May 2026 (4 months)
Owned end-to-end reliability for ingestion and streaming of a high-throughput platform (500K–1M+ transactions/day), including an org-wide SLO/SLI and error-budget framework rolled out across services. Re-architected Kafka streaming with failure isolation, retries, and backpressure, reducing p95 latency by ~35% and improving incident workflow efficiency (MTTD ~40%, MTTR ~30%).
Senior Site Reliability Engineer
Unqork
Jan 2023 - Jan 2026 (3 years)
Owned reliability and infrastructure strategy for a distributed SaaS platform (~99.9% uptime) handling millions of requests/day. Designed Kafka-based async processing (~100K msgs/min) and rebuilt observability with signal-driven SLO alerting, reducing alert fatigue and MTTD by ~40% while leading incident lifecycle and lowering Sev-1 recurrence.
Principal Site Reliability Engineer
DAI
Feb 2021 - Jan 2023 (1 year 11 months)
Led reliability for latency-sensitive financial systems on GCP, improving system resilience with fault-tolerant distributed design patterns and re-architecting a performance layer to reduce database query latency by ~40%. Built observability tooling and coordinated cross-team incident response across distributed ownership boundaries.
Part-time SRE & Software Engineer
The American Board of Radiology
Aug 2014 - Dec 2021 (7 years 4 months)
Progressed from backend engineering into SRE to improve production reliability and system performance, optimizing MERN APIs by reducing bottlenecks via indexing and query tuning. Transitioned into a monitoring and incident-response focus, improving production visibility and reducing time-to-resolution.
Software Engineer
IBM
Jan 2012 - Aug 2014 (2 years 7 months)
Built backend services and REST APIs for an enterprise platform and supported early-stage scalability and production readiness. Improved performance through service-layer and database optimization.
Education
Degrees, certifications, and relevant coursework
University of Arizona
Bachelor of Science, Computer Science
2009 - 2011
Earned a Bachelor of Science in Computer Science at the University of Arizona from 2009 to 2011.
Tech stack
Software and tools used professionally
Availability
Location
Authorized to work in
Job categories
Skills
Interested in hiring Brian?
You can contact Brian and 90k+ other talented remote workers on Himalayas.
Message BrianFind your dream job
Sign up now and join over 100,000 remote workers who receive personalized job alerts, curated job matches, and more for free!
