Qualifications:
- Bachelor’s degree in Computer Science, Engineering, or related technical field (or equivalent experience).
- Minimum of 2 years of experience in systems engineering, DevOps, or Site Reliability Engineering roles.
- Strong proficiency with Linux/Unix operating systems.
- Experience with scripting and automation using Python, Bash, or similar languages.
- Experience with monitoring and logging tools such as Prometheus, Grafana, ELK Stack, or equivalent.
- Experience supporting CI/CD tools such as GitLab, Jenkins, or ArgoCD.
- Experience with containerization and orchestration platforms including Docker and Kubernetes.
- Understanding of SRE principles including SLIs, SLOs, and error budgets.
- Strong troubleshooting, problem-solving, and documentation skills.
- Monitor system health, availability, and performance using centralized monitoring and logging tools.
- Respond to, troubleshoot, and resolve incidents in production environments and provide root cause analysis.
- Conduct after-action reporting and post-incident reviews to improve system resilience.
- Automate repetitive operational tasks including deployments, monitoring, and incident response.
- Administer user accounts, access controls, and authentication mechanisms.
- Maintain and configure workflow templates, user fields, and application configurations.
- Maintain test environments that mirror production and support pre-deployment testing.
- Design and maintain backup, high availability (HA), and disaster recovery (DR) solutions.
- Develop and maintain incident response and disaster recovery plans for supported applications.
- Configure and support integrations with complementary enterprise systems.
- Architect, build, and maintain on-premise and cloud infrastructure supporting applications.
- Administer production, staging, and development environments.
- Manage system logs and monitor for security and operational events.
- Maintain and improve CI/CD pipelines and DevSecOps processes.
- Apply configuration management disciplines including patching, hardening, and documentation.
- Create and maintain dashboards, SLIs, SLOs, and service health metrics.
- Support operational readiness boards and weekly service reviews.
- Provide on-call support for outages, upgrades, and emergency maintenance as required.
- Support surge activities, including Presidential Transition-related data analysis if required.
