Hitachi Solutions is a global Microsoft solutions integrator seeking a Lead Site Reliability Engineer to design and implement CI/CD tooling using GitHub Actions/Azure DevOps, Azure Kubernetes AKS clusters, and related technologies. The successful candidate will work closely with application, engineering, security, and operations teams to engineer and build Kubernetes and Azure PaaS & IaaS solutions within an agile and modern enterprise-grade operating model.
Requirements
- Responsible for availability, latency, performance, efficiency, monitoring/observability, emergency response, capacity planning, setting, and maintaining SLOs, SLIs and Error Budgets, creating dashboards.
- Analyze, troubleshoot, and resolve operational challenges contributing to defined SLO's.
- Manage site stability, performance, reliability, and maintain uptime for production environments.
- Develop a fully automated multi-environment observability stack based on the existing system and extend it to predict capacity needs based on the usage patterns.
- Strive for automation to reduce toil and increase development velocity.
- Perform application-specific production support, incident management, change management, problem management, RCAs, and service restoration as needed.
- Identify changes for the product architecture from the reliability, performance and availability perspective with a data driven approach.
- Analyze and address complex technical challenges and issues that arise during the software development & run lifecycle. Debug, troubleshoot, and resolve technical problems efficiently.
- Create and maintain technical documentation, including design specifications, user guides, run books and best practice guidelines.
- Actively look for opportunities to improve the availability and performance of the system by applying the learnings from monitoring and observation.
- Collaborate with software development teams in the release management process and to shape the future roadmap and establish strong operational readiness across teams.
- Participate in Agile ceremonies, such as sprint planning, stand-up meetings, and retrospectives.
- Collaborate with product managers, designers, and other engineers to ensure alignment and efficient project execution.
- Share your expertise and mentor engineers, helping them grow and develop their skills. Foster a culture of continuous learning and improvement within the team.
- Stay updated with the latest technologies, tools, and cloud computing. Proactively learn and adapt to new technologies to drive innovation.
- Collaborate with customers to understand their needs, gather feedback, and provide technical support and guidance as needed.
- Triage incoming Web Support escalation requests routing to applicable internal teams
- Contribute to incident root cause analysis, service restoration, and serve as an incident commander during outage events.
- Strong background as a SRE supporting a 24x7 highly available production environment for a SaaS or cloud service provider.
- Solid experience with Monitoring/APM/Observability tools (Data dog, Application Insights, Prometheus, Grafana etc.,)
- Strong backgroud with Azure Resources like Key Vault, Data Factory, Azure Databricks and Storage Accounts.
- Experience implementing observability plans around logs, metrics, and traces.
- Experience in an agile development team developing software.
- Implement and participate exercising best practices for CI/CD.
- Experience with cloud infrastructure environments, preferably Azure, and Infrastructure as code (Terraform, Bicep, ARM).
- Design, develop, and maintain infrastructure using popular IaC tools and technologies like Terraform, Helm, others.
- Strong experience with containerization technology and/or Kubernetes.
- Experience with Release automation, system administration, configuration management.
- Experience with programming languages (Python, Go, etc.).
- Strong understanding of Linux, Windows, software development, systems, networking, and cloud concepts.
- Strong interpersonal and teaming skills - ability to set and enforce process and influence engineers who are not direct reports.
- Strong analytical and programming skills (Python, Go etc.).
- Experience with MLFlow and other MLOps pipeline technology
Benefits
- Bonus Plan
- Medical, Dental and Vision Coverage
- Life Insurance and Disability Programs
- Retirement Savings with Company Match
- Paid Time Off
- Flexible Work Arrangements including Remote Work