Responsibilities:
- Design, build, and maintain scalable and resilient infrastructure on the edge.
- Develop automation and infrastructure-as-code solutions using Terraform, Ansible, and scripting languages (Python, Bash).
- Deploy and manage containerized applications using Docker and related technologies.
- Ensure system observability by building and optimizing monitoring systems, particularly using Prometheus.
- Troubleshoot and optimize Linux-based systems (e.g., Red Hat, CentOS, Ubuntu).
- Collaborate with security teams to implement robust security practices and ensure compliance with best practices.
- Work closely with software engineers to improve system performance, reliability, and deployment pipelines.
- Support and maintain networking infrastructure, including troubleshooting protocols and configurations.
- Manage cloud and on-premise infrastructure, with a focus on automation and scalability.
- Contribute to incident response, postmortems, and process improvements.
Requirements:
- 5+ years of experience in Site Reliability Engineering and building/managing infrastructure at scale, particularly on the edge.
- Strong software development experience in one or more programming languages (e.g., Python, Go, Java).
- Proficiency in Python, Docker, Linux systems, and scripting (Bash, Python).
- Strong expertise with infrastructure automation tools (Terraform, Ansible).
- Experience managing observability and monitoring systems, particularly Prometheus.
- Deep understanding of networking concepts and protocols.
- Familiarity with cloud platforms (AWS, Azure, Google Cloud) is a plus.
- Experience with Windows Services/VMs is a plus.
- Excellent problem-solving skills, with attention to detail.
- Strong communication and collaboration skills to work across teams.
- Bachelor’s degree in Computer Science, Information Technology, or a related field, or equivalent experience.