Serve RoboticsSR

Director of Systems Reliability & Field Resilience

Serve Robotics

Salary: 210k-250k USD

United States only

At Serve Robotics, we’re reimagining how things move in cities. Our personable sidewalk robot is our vision for the future. It’s designed to take deliveries away from congested streets, make deliveries available to more people, and benefit local businesses.

The Serve fleet has been delighting merchants, customers, and pedestrians along the way in Los Angeles while doing commercial deliveries. We’re looking for talented individuals who will grow robotic deliveries from surprising novelty to efficient ubiquity.

Who We Are

We are tech industry veterans in software, hardware, and design who are pooling our skills to build the future we want to live in. We are solving real-world problems leveraging robotics, machine learning and computer vision, among other disciplines, with a mindful eye towards the end-to-end user experience. Our team is agile, diverse, and driven. We believe that the best way to solve complicated dynamic problems is collaboratively and respectfully.

Serve Robotics is seeking a Director of Systems Reliability & Field Resilience, responsible for continuously improving end-to-end operational reliability across our robotic delivery operations infrastructure. In this role, you and your team will proactively identify, triage, and resolve complex, cross-domain issues that impact delivery service quality/efficiency, and will work cross-functionally to build monitoring, alerting, automation and resiliency into our platform.

In this role you will provide leadership and direction to your team while also contributing directly in defining, building and deploying solutions. You will work closely with engineering, product and operations to prioritize the work, and you’ll hire, allocate resources and support your team to deliver capabilities from concept to production.

The Serve Robotics delivery platform spans a wide range of technologies, from cloud and networking infrastructure that powers delivery matching, front-end solutions for robot fleet supervisors and field agents, and on-robot embedded and autonomous systems that all must work seamlessly together to fulfill our daily delivery growth and economics. You will lead a team of experts with backgrounds in SRE, Devops and Cloud Infrastructure and partner across the entire engineering organization to ensure a robust and resilient delivery infrastructure.

The ideal candidate will have a strong track record of hands-on leadership of small and highly technical software engineering teams. You will have experience hiring, mentoring and coaching Sr. level engineers, building a high-performance, collaborative team. You are a highly capable and technical generalist who is comfortable working across all components of a complex system and partnering with domain experts and functional teams to identify issues, perform detailed root cause analysis, and develop strategies for short- and long-term solutions that will often require highly technical collaboration between your team and other engineering teams to deliver.

Responsibilities

  • Full-Stack Troubleshooting & System Deep Dives: Become the go-to expert for identifying root causes of service issues—whether they're in cloud APIs, robot hardware, network layers, or operational workflows—and coordinate with the respective owning teams to resolve and prevent them.

  • Build and Lead a Global Systems Reliability Team: Hire, mentor, and grow a multidisciplinary team of high-context generalists who can investigate system-wide failures, document their learnings, and drive improvements across organizational boundaries.

  • Own the On-Call & Incident Management Process: Take over and evolve the company's on-call process into a mature, well-documented, and inspectable system. Define SLAs, escalation policies, and a best-in-class paging infrastructure that aligns with our service goals.

  • Establish and Maintain a Knowledge Base: Ensure on-call responders have access to actionable documentation, playbooks, and troubleshooting guides. Make knowledge capture a core part of incident response.

  • Reliability Analytics & Intuition Building: Use incident and operational data to build a deep intuition about where our systems are most fragile. Create predictive frameworks and reliability metrics that help the organization stay ahead of failures.

  • Service Health & Performance Dashboards: Build and maintain dashboards that monitor the health of end-to-end services—not just software, but everything that supports customer delivery. Highlight systemic issues, performance regressions, and areas needing investment.

  • Cross-Functional Collaboration: Work closely with engineering, infrastructure, hardware, field ops, customer support, and leadership to align on reliability priorities and drive systemic improvement efforts.

Qualifications

  • 8+ years of experience in a technical engineering or operations role, with at least 3 years in a leadership position. Background in both software engineering and IT/DevOps a plus.

  • Deep experience with complex distributed systems, infrastructure, and system debugging, triage and root cause analysis. Familiarity with observability tools like Datadog, Grafana, Prometheus, ELK, etc. a plus.

  • Strong understanding of hardware/software integration, particularly in cloud-connected device infrastructure including robotics, consumer electronics and embedded systems

  • Proven success leading incident response or SRE-style functions, and managing on-call teams

  • Ability to drive organization wide improvements by building trusted cross-functional relationships and technical collaboration across teams

  • Strong data and dash-boarding skills; can translate operational data into clear insights and action plans

  • Excellent communication and organizational skills; comfortable writing high-quality docs and leading blameless postmortems

What Makes You Stand Out

  • Relentless Drive for Quality: You set high standards for code and system design, continually raising the bar for your team and the organization.

  • Strong Cross-Functional Communicator: You effectively collaborate with product, operations, and executive teams to ensure technology and business goals are aligned.

  • Strategic Vision Paired with Execution: You think beyond immediate tasks to chart a roadmap that ensures platform longevity and innovation. You excel at driving changes that boost overall team cohesion and performance.

  • Passion for Innovation: You bring curiosity and enthusiasm for solving complex challenges in delivery and fleet management, keeping up with the latest trends and technologies in the space.

About the job

Apply before

Posted on

Job type

Full Time

Experience level

Director

Salary

Salary: 210k-250k USD

Location requirements

Hiring timezones

United States +/- 0 hours
Claim this profileServe Robotics logoSR

Serve Robotics

View company profile

Similar remote jobs

Here are other jobs you might want to apply for.

View all remote jobs

15 remote jobs at Serve Robotics

Explore the variety of open remote roles at Serve Robotics, offering flexible work options across multiple disciplines and skill levels.

View all jobs at Serve Robotics

Remote companies like Serve Robotics

Find your next opportunity by exploring profiles of companies that are similar to Serve Robotics. Compare culture, benefits, and job openings on Himalayas.

View all companies

Find your dream job

Sign up now and join over 85,000 remote workers who receive personalized job alerts, curated job matches, and more for free!

Sign up
Himalayas profile for an example user named Frankie Sullivan
Serve Robotics hiring Director of Systems Reliability & Field Resilience • Remote (Work from Home) | Himalayas