We are looking for an Operations Engineer who is technically curious, detail-oriented, a strong communicator, and proactive to join our Global Technical Operations (GTO) team.
Requirements
- Serve as the primary dashboard monitor during your shift — continuously watch the GTO Operational Dashboard in Datadog, detect anomalies by correlating signals across APM, logs, metrics, synthetic tests, and Real User Monitoring, and determine whether alerts warrant an incident ticket or can be resolved through immediate investigation.
- Triage and investigate production incidents — create incident tickets in JIRA Service Management, perform initial technical investigation using Datadog (traces, logs, infrastructure and application metrics), determine blast radius and likely root cause domain, and route to the correct team (Product SRE, Infrastructure SRE, or Engineering) using the smart routing model.
- Own lower-severity incidents end-to-end from detection through resolution — diagnose, execute runbook procedures, and resolve without escalation where possible. Escalate promptly when an incident is unresolved within defined thresholds or requires a code-level fix.
- Support the TSO Lead during major incidents as the technical right hand in the war room — surface real-time data (error rates, impact scope, deployment history, related alerts), maintain the incident ticket with live timeline entries and linked evidence, and execute mitigation actions as directed.
- Draft incident communications under TSO Lead direction, including internal Slack updates, stakeholder notifications, and customer-facing status page updates (status.xsolla.com). Support clear, timely communication throughout the incident lifecycle.
- During non-incident periods, analyze incident trends, recurring issues, and production bugs — compile data from Datadog, JIRA, and Slack, identify patterns, and contribute findings to regular reports for product and engineering teams.
- Compile incident timelines and draft initial PIR documents for Post-Incident Review preparation. Track PIR action items post-session and flag overdue items to the TSO Lead.
- Build and maintain operational automation (alert enrichment scripts, incident templates, Slack workflows, dashboard widgets) and contribute to runbook development — documenting new resolution procedures so they can be repeated by any Operations Engineer on any shift.
- Conduct structured shift handoffs covering active incidents, at-risk services, upcoming deployments, and follow-up items. Participate in knowledge transfer sessions with SREs to continuously expand independent resolution capability.
- Cover for the TSO Lead during vacations, absences, or emergencies — including severity classification, escalation decisions, stakeholder communications, and basic Incident Commander functions.
- Publish health reports of critical apps periodically.
Benefits
- Medical, dental, and vision
- PTO
- Personalized career roadmap for each employee
- Professional development through training and educational opportunities
