Bright Vision Technologies is seeking an experienced Site Reliability Engineer to join our dynamic team and contribute to our mission of transforming business processes through technology.
Requirements
- Define, instrument, and continually refine service-level objectives (SLOs), service-level indicators (SLIs), and error budgets for critical services, and use those measures to drive concrete engineering and prioritization decisions.
- Lead incident response and resolution for production issues, acting as a calm and effective incident commander when needed, and ensuring high-quality post-incident reviews that drive lasting improvements.
- Design and implement comprehensive monitoring, logging, and tracing strategies using Prometheus, Grafana, OpenTelemetry, ELK/EFK, Datadog, or similar tooling so that operators have rich, actionable visibility into system behavior.
- Build and maintain robust on-call processes, runbooks, and escalation paths that reduce mean time to detect and mean time to resolve while protecting the well-being of the engineers on rotation.
- Automate operational toil aggressively by writing production-grade tooling in Python, Go, Bash, or similar languages, replacing manual workflows with reliable, auditable automation.
- Architect and operate large-scale Kubernetes clusters and container-based workloads, including autoscaling, capacity planning, network policy, and integration with service meshes.
- Design CI/CD pipelines that promote safe, frequent, and observable releases, supported by automated testing, canary deployments, feature flags, and progressive rollout strategies.
- Lead capacity planning and performance engineering activities, building models that predict growth and stress, and validating those models through load testing and chaos experiments.
- Partner closely with application development teams to embed reliability practices early in design — including failure-mode analyses, graceful degradation patterns, and dependency hardening.
- Strengthen the platform’s resiliency through chaos engineering, fault injection, dependency isolation, retries, timeouts, circuit breakers, and well-tested failover paths.
- Drive continuous improvement of security posture in collaboration with security teams, including patch management, vulnerability remediation, and secure-by-default platform defaults.
- Contribute to the technical roadmap for reliability tooling, observability platforms, and developer-experience improvements that reduce friction and improve outcomes for engineering teams.
- Mentor engineers across the organization on SRE practices and foster a strong, blameless culture of operational excellence.
Benefits
- Competitive base salary commensurate with experience, plus benefits.
- 401k Matching
