HimalayasHimalayas logo
ArcadiaAR

Principal Site Reliability Engineer

Arcadia is a healthcare technology company that transforms data into actionable insights to accelerate healthcare transformation for providers, payers, and life sciences organizations.

Arcadia

Employee count: 201-500

Salary: 180k-230k USD

United States only

Stay safe on Himalayas

Never send money to companies. Jobs on Himalayas will never require payment from applicants.

Arcadia is dedicated to happier, healthier days for all. We believe that there is a better healthcare world – one powered by data. Our platform transforms complex, diverse data into a unified foundation for health, helping organizations deliver better care, boost revenue, and lower costs.

We’re a team of fiercely driven individuals committed to making healthcare more sustainable—and we’re looking for passionate people to help us get there.
For more information, visit arcadia.io

Why This Role is Important to Arcadia

Love building reliable systems, and want to make a difference?
Arcadia’s customers rely on us to securely process and deliver high-value healthcare insights. Reliability, availability, performance, and security are foundational to trust—especially when systems support critical workflows and handle PHI. As a Principal Site Reliability Engineer, you’ll set reliability strategy across teams, drive cross-cutting platform improvements, and ensure we can scale delivery without scaling operational burden.

What Success Looks Like

In 3 months

Build deep context on Arcadia’s platform, production risks, and operational practices. Participate in on-call/incident response and quickly improve signal quality for at least one critical domain (dashboards, alerts, traces, runbooks). Identify a high-leverage reliability initiative and align stakeholders on scope, success metrics, and milestones.

In 6 months

Establish SLOs/error budgets for key customer journeys, drive operational readiness standards for launches, and lead remediation for recurring incidents with measurable reductions in customer impact and MTTR. Deliver major toil-reduction improvements via automation and self-service workflows.

In 12 months

Own and execute a reliability program with cross-org impact (e.g., GitOps delivery guardrails, observability platform evolution, resilience/DR improvements, or secure infrastructure controls). Influence architecture decisions, establish org-wide operational standards, and mentor Staff engineers—raising the reliability and security bar across Arcadia.

What You'll Be Doing

  • Act as the technical leader for reliability for one or more domains; set direction and standards while remaining hands-on where it matters most
  • Drive reliability strategy across critical services: define SLOs/SLIs, error budgets, and reliability KPIs aligned to customer journeys and outcomes
  • Own incident response maturity: lead complex incidents, improve incident command practices, and ensure high-quality RCAs with prioritized, tracked remediation
  • Architect and implement automation to reduce toil and risk: runbook automation, self-service tools, and safe operational workflows (Python + Argo Workflows)
  • Advance GitOps delivery practices using Argo CD: promotion strategies, progressive delivery/canaries, and guardrails that reduce deploy risk
  • Scale infrastructure management with Crossplane and Terraform: reusable patterns, policy controls, and paved roads for teams
  • Lead operational readiness and reliability reviews for new features/architectural changes; reinforce non-functional requirements (availability, latency, security, cost)
  • Improve performance and cost efficiency through capacity planning, load testing, right-sizing, and architecture recommendations across AWS services
  • Champion infrastructure security best practices for environments that handle PHI (least privilege, secrets management, auditability, and defense-in-depth)
  • Mentor Staff and Senior engineers through design reviews, code reviews, pairing, and documentation; raise reliability standards across teams

What You'll Bring

  • 8+ years of experience in SRE, platform engineering, systems engineering, or related roles operating production services at scale
  • Demonstrated principal-level impact: leading cross-team initiatives, influencing architecture decisions, and driving sustained improvements in reliability and operations
  • Expertise in Kubernetes operations and troubleshooting, including safe rollout/rollback patterns, workload debugging, and operational guardrails
  • Strong GitOps experience with Argo CD; experience building delivery workflows and automation using Argo Workflows
  • Strong infrastructure orchestration and provisioning experience with Crossplane and Terraform; ability to define reusable platform patterns and controls
  • Deep AWS experience (IAM, networking/VPC, compute, storage, managed services, observability) and strong understanding of reliability and failure modes in cloud systems
  • Proficiency in Python for building automation, tooling, and reliability improvements
  • Strong incident management and on-call leadership experience, including measurable improvements (availability, MTTR, alert quality, cost, or operational maturity)
  • Excellent communication skills: can translate technical risk and reliability tradeoffs to engineering leadership, product, and stakeholders; produces high-quality docs/runbooks

Would Love For You To Have

  • Experience with ScyllaDB or similar distributed databases (e.g., Cassandra) and their reliability/performance characteristics
  • Experience with Spark or data processing platforms, including reliability and cost considerations for large-scale workloads
  • Familiarity with agentic coding practices and principles (safe automation, reviewable changes, guardrail-first workflows)
  • Strong infrastructure security knowledge: threat modeling for cloud/Kubernetes, RBAC/IAM design, secrets management, supply chain security, and security observability
  • Principal Engineer Competencies

  • Customer Focus: champions customer impact; drives SLO definition with product partners; participates in incidents to limit customer impact; may engage customers to understand problems
  • Technical Leadership: leading cross-team technical representative; negotiates interfaces; anticipates edge cases; designs telemetry for availability and reliability
  • Total Ownership: owns outcomes from requirements and design through production support; transitions complex changes with multi-phase rollouts and long-term ownership
  • Effective Communication: communicates to diverse audiences; finalizes key documentation (runbooks, guides, FAQs); synthesizes standards and best practices
  • Proactive Leadership: coaches senior/peer teams primarily through review; delegates appropriately; sets clear expectations (Definition of Done) and improves service processes/rotations

What You'll Get

  • Be a part of a mission driven company that is transforming the healthcare industry by changing the way patients receive care
  • A flexible, remote friendly company with personality and heart
  • Employee driven programs and initiatives for personal and professional development
  • Become a member of the talented, energized, diverse and purpose-driven Arcadian Community
This position is responsible for following all Security policies and procedures in order to protect all PHI under Arcadia's custodianship as well as Arcadia Intellectual Properties. For any security-specific roles, the responsibilities would be further defined by the hiring manager.

About Arcadia

Arcadia.io helps innovative providers and payers across the country transform healthcare to reduce cost while improving patient health. We do this by aggregating large amounts of disparate data, applying algorithms to identify opportunities to provide better patient care, and making those opportunities actionable by physicians at the point of care in near-real time. We are passionate about helping our customers drive meaningful outcomes. We are growing fast and have emerged as a market leader in the highly competitive population health management software market and have been recognized by industry analysts KLAS, IDC, Forrester, and Chilmark for our leadership. For a better sense of our brand and products, please explore our website.

Protect Yourself

If you have concerns about the authenticity of a job offer or recruitment-related communication claiming to be from Arcadia, we encourage you to verify by contacting us directly at (781) 202-3600 and select option 3. For more information, visit our website.
This position is responsible for following all Security policies and procedures in order to protect all PHI under Arcadia's custodianship as well as Arcadia Intellectual Properties. For any security-specific roles, the responsibilities would be further defined by the hiring manager.

About the job

Apply before

Posted on

Job type

Full Time

Experience level

Salary

Salary: 180k-230k USD

Experience

8 years minimum

Location requirements

Hiring timezones

United States +/- 0 hours

About Arcadia

Learn more about Arcadia and their company culture.

View company profile

Arcadia is a healthcare technology company dedicated to transforming diverse data into a unified fabric for health. The company's platform delivers actionable insights that empower healthcare providers, payers, and life sciences organizations to advance care, drive strategic growth, and achieve financial success. Arcadia focuses on aggregating and curating high-quality, comprehensive, and up-to-date data from various sources, including electronic health records (EHRs), claims data, and social determinants of health (SDoH) information. This unified data foundation enables the delivery of relevant, timely, and predictive analytics.

Arcadia's solutions support a range of critical healthcare functions, including population health management, value-based care, risk adjustment, and quality improvement. The company's offerings include advanced analytics dashboards, benchmark reporting, care management tools, and point-of-care insights. By leveraging machine learning and artificial intelligence, Arcadia helps its customers identify at-risk populations, optimize clinical workflows, improve patient engagement, and enhance financial performance. Arcadia's technology integrates with over 2,600 data sources and is built upon a massive data asset of over 170 million clinical patient records. The company is committed to helping its clients outperform industry averages by reducing medical expenses, improving risk coding accuracy, and enhancing the quality of care and patient health outcomes. Arcadia is recognized for its expertise in both fee-for-service optimization and value-based performance environments, supporting healthcare enterprises as they transition to new care models.

Claim this profileArcadia logoAR

Arcadia

View company profile

Similar remote jobs

Here are other jobs you might want to apply for.

View all remote jobs

14 remote jobs at Arcadia

Explore the variety of open remote roles at Arcadia, offering flexible work options across multiple disciplines and skill levels.

View all jobs at Arcadia

Remote companies like Arcadia

Find your next opportunity by exploring profiles of companies that are similar to Arcadia. Compare culture, benefits, and job openings on Himalayas.

View all companies

Find your dream job

Sign up now and join over 100,000 remote workers who receive personalized job alerts, curated job matches, and more for free!

Sign up
Himalayas profile for an example user named Frankie Sullivan