This role requires U.S. Citizenship and eligibility for a Federal Security Clearance
Our Team
Building off our Cloud momentum, Oracle has formed a new organization - Oracle Health Data, Analytics Platform. This team will focus on product development and product strategy for Oracle Health, while building out a complete platform supporting modernized, automated healthcare. This is a net new line of business, constructed with an entrepreneurial spirit that promotes an energetic and creative environment. We are unencumbered and will need your contribution to make it a world class engineering center with the focus on excellence.
Oracle Health Data, Analytics Platform has a rare opportunity to play a critical role in how Oracle Health products impact and disrupt the healthcare industry by transforming how healthcare and technology intersect.
You will have the opportunity to:
- Reach billions of people with our products & services
- Create technology in which truly impacts the world
- Ability to have immediate impact on developing technology
- Unlimited growth potential with inspiring work
- Work with the best minds in the industry
- Enjoy working in an open, diverse, and productive environment
About The Job
This role provides technical leadership for the core data platforms behind Oracle Health’s Data & Analytics Platform. As a Principal Site Reliability Engineer (SRE), you will own shared, mission-critical systems used by multiple products and teams.
You will lead the design and operation of large-scale, stateful distributed platforms, including Hadoop ecosystem components (HDFS, YARN, HBase) deployed on Oracle Big Data Service (BDS), Kafka, and Storm. These multi-tenant platforms are deployed and operated through Ansible- and Terraform-based automation and require strong architectural ownership to manage scale, change, and broad blast radius.
What You'll Do
Platform Ownership & Technical Leadership
- Own the end-to-end reliability, scalability, and operability of shared data platforms
- Define platform standards, architectural direction, and operational guardrails
- Influence cross-team technical decisions and long-term platform strategy
- Drive long-term platform evolution and influence reliability strategy across the data ecosystem
Architecture & Design
- Lead platform architecture and design reviews
- Clearly articulate system behavior, dependencies, and failure modes
- Make principled trade-offs between reliability, performance, cost, and complexity
- Provide guidance and guardrails that enable downstream teams to use platforms safely and effectively
Operations Engineering
- Establish capacity models, scaling strategies, and operational best practices
- Design platforms that behave predictably under load, failure, and change
- Own platform lifecycle events: upgrades, expansions, decommissioning, and recovery
Distributed Systems Expertise
- Operate and evolve stateful distributed systems where data placement, replication, and recovery are critical
- Reason about failure modes such as backpressure, rebalancing, region movement, replication lag, and rolling upgrades
Security
- Operate and maintain Kerberized platforms, including authentication, authorization, and secure service-to-service communication
- Treat security as a first-class architectural concern
Automation
- Design and evolve an Ansible- and Terraform-driven automation framework
- Treat automation as production software: versioned, reviewed, tested, and improved
- Eliminate operational toil by encoding reliability and safety into the platform
Incident Leadership & Prevention
- Serve as the ultimate escalation point for complex or ambiguous incidents
- Focus on eliminating entire classes of failure, not just resolving individual issues
Representation
- Represent SRE and platform engineering in high-visibility and sensitive forums
- Communicate clearly with engineering leadership and partner teams
