Cloud Operations Engineers are responsible for keeping all production systems at Actian running smoothly. COE are expected to apply sound engineering principles and operational discipline to develop and deliver automation into our environments. Create monitoring and telemetry to gain insight into the patterns that govern our success and allow Actian to deliver on our uptime commitments. Additionally, you would develop and enhance the CD pipeline to deliver successful builds to production. COE is created to help drive operational excellence through automation and monitoring. Working closely with development teams to build reliability directly into our products and architecture. In the Cloud Operations Team, you will be at the center of driving improvements and change across the organization in additional to accelerating our adoption of containers and Kubernetes.The successful candidate is customer focused, a self-starter, able and willing to work with geo-dispersed teams.
Responsibilities:
- Monitor and debug issues across the platforms (applications, networks, databases)
- Administer, maintain, automate systems to ensure reliability, resiliency, scalability, and security
- Deploy, maintain, and enhance monitoring solutions and provide technical resolutions and root cause analysis for high severity incidents
- Work closely with Engineering and Software Development teams to design, deploy, and operate components/services that are automated, resilient, and scalable
- Ensures that documented SSAE Policies and Procedures are followed and enforced
- Create, update, and maintain documentation for all configurations for the production environment
- Maintains and ensures the readiness and availability of disaster recovery environments
- Develop and deliver timely reports on service metrics including but not limited to availability, capacity, performance, and latency across all production systems
- Manage a 24x7x365 regional operational team
Skills & Qualifications:
- Bachelor’s degree in computer science or equivalent experience related to Information Technology
- 3+ years’ experience as a Cloud Operations Engineer or Site Reliability Engineer managing a SaaS / PaaS / IaaS environment
- Experience managing Linux and Windows Server
- Experience with the configuration and automation toolsets such as Terraform, Puppet, Chef and Ansible
- Experience in monitoring a global Cloud footprint. Hands-on with modern monitoring platforms and time-series databases, such as Grafana, Prometheus, DataDog, or SumoLogic, Nagios, Zenoss
- Experience in the design and/or deployment of Public Cloud technologies (AWS, Azure, GCP)
- Experience in Network Services such as DNS, DHCP, WAN Routing, TCP/IP networking and DNS, LDAP, NFS and SMTP.
- Knowledge of RDBMS systems such as MySQL and SQL Server.
- Experience with containerization and container orchestration especially with Docker, Kubernetes
- Experience in the deployment and management of microservices
- Experience maintaining and managing Spark, Kafka, Tomcat, Cassandra, and MySQL based systems
- Proficient with Python, Bash, SQL or Java
- Requires the ability to write and present effective materials, including presentations, status reports, technical diagrams, and flowcharts
- Requires the ability to use problem-solving techniques, such as root cause analysis, to resolve issues.
- Solid understanding of incident management, change management, and problem management
Nice to Haves:
- Experience working with a globally distributed team
- Understanding of software development lifecycle and CI/CD pipelines
- Experience architecting and optimizing cloud platforms
- Certifications in either of the following AWS, Azure and GCP