Key Responsibilities
- Monitoring & Incident Response
- Proactively monitor infrastructure, applications, and services using monitoring tools (e.g. Grafana).
- Respond to alerts, investigate issues, and escalate incidents as per defined protocols.
- Maintain accurate incident logs and documentation.
- Troubleshooting & Support
- Perform initial triage of system, network, and application issues.
- Provide L1 support for servers, databases, cloud services, and networking equipment.
- Collaborate with L2/L3 engineers to resolve complex incidents.
- Reliability & Performance
- Assist in implementing SRE best practices such as monitoring, alerting, and automation.
- Support routine health checks, capacity monitoring, and performance tuning.
- Participate in post-incident reviews and contribute to root cause analysis.
Required Skills & Qualifications
- Basic knowledge of:
- Networking concepts (TCP/IP, DNS, firewalls).
- Linux/Windows server administration.
- Cloud platforms (OCI) – fundamentals preferred.
- Basic Oracle/PLSQL skills for query writing, data validation, and troubleshooting
- Familiarity with monitoring tools and ticketing systems (Jira, etc.).
- Strong analytical and troubleshooting skills.
- Excellent communication and documentation abilities.
- Willingness to work 12-hour shifts in a 24x7 shift environment.
