We’re specifically looking for candidates who have operated in high-growth, early-stage environments, where systems evolve rapidly and engineers are trusted with end-to-end ownership. If you thrive in ambiguity, move fast without sacrificing quality, and know how to scale systems from zero to hundreds of thousands of users — we want to hear from you.
A high-growth, mission-driven tech company is re-imagining how knowledge is shared — making it faster, clearer, and more collaborative across the globe. We're helping millions of users communicate ideas more effectively and are backed by top-tier investors and accelerators. Now, we're looking for an experienced Senior or Staff Site Reliability Engineer (SRE) who has scaled systems within fast-paced startup environments.
This is a rare opportunity to join early, shape foundational infrastructure, and directly impact how engineering operates and scales. We’re searching for a thoughtful, hands-on engineer with the technical depth and startup grit to help architect our next phase of growth.
What You’ll Do
Improve Platform Resilience: Constantly refine system stability, scalability, and deployment efficiency across all environments.
Build Observability Systems: Develop actionable monitoring, alerting, and telemetry to support performance, uptime, and operational excellence.
Architect for Scale: Design and maintain reliable, distributed infrastructure that supports a growing global user base.
Lead Incident Response: Own escalations, guide triage and recovery efforts, and drive post-incident reviews focused on learning and prevention.
Define and Maintain SLOs/SLIs: Create performance metrics and operational standards to guide service-level goals.
Mentor & Evangelize Best Practices: Promote architectural quality, tool adoption, and platform improvements throughout the engineering team.
Operate Like a Startup: Move quickly, solve problems creatively, and build systems with speed and purpose.
What You Bring
10+ years of experience in software engineering, DevOps, or SRE roles, with increasing responsibility for platform operations and reliability.
Proven experience working in a startup environment, ideally where you’ve built or evolved critical infrastructure during rapid growth phases.
Proficiency in two or more languages (e.g., Python, Go, JavaScript, TypeScript), ideally used in automation, internal tooling, or backend development.
Deep knowledge of AWS infrastructure (e.g., ECS, EKS, CloudRun) and tools like Terraform or CDK for provisioning.
Strong background in observability and performance tooling (e.g., Prometheus, Datadog, OpenTelemetry, ELK).
Experience with CI/CD pipelines, configuration management, and automating platform operations end-to-end.
Expertise with Linux systems, cloud networking, Kubernetes orchestration, and infrastructure security.
Familiarity with database performance tuning, especially in high-load or data-intensive environments.
Bachelor’s degree in Computer Science or a related field (or equivalent practical experience).
Who You Are
You’ve built and shipped systems in a startup and understand how to balance urgency with reliability.
You care deeply about building tools and infrastructure that empower developers and create leverage.
You bring a calm, strategic mindset to incidents and enjoy making systems better through root cause analysis and iteration.
You’re collaborative, solutions-driven, and comfortable leading change in a fast-evolving environment.
You communicate clearly, act with ownership, and love solving real problems with real impact.
Why Join
Join a mission-first team working on a platform that already impacts millions of people worldwide.
Help define foundational infrastructure and best practices from the ground up.
Work with kind, passionate, and high-performing teammates who value excellence and humility.
Enjoy the flexibility of a remote-first culture and a team that values outcomes over hours.