Ford is seeking an experienced Senior Software Engineer to join its team in developing, enhancing, and expanding its suite of chaos engineering and observability generation tooling.
Requirements
- Write, configure, and deploy code in Go, JavaScript & Python that improves service reliability for existing or new systems
- Develop APIs (REST/gRPC), integrations, and high-performance backend services
- Champion test-driven development
- Write documentation: end-user documentation, ADRs/design, system analysis, runbooks, playbooks
- Design distributed systems in the cloud, preferably using Google Cloud Platform (GCP)
- Provide helpful and actionable feedback and review for code or production changes
- Drive repair/optimization of complex systems with consideration towards a wide range of contributing factors
- Lead debugging, troubleshooting, and analysis of service architecture and design
- Participate in on-call rotation
- Implement and manage suite of chaos engineering products written in Golang, JavaScript & Python
- Collaborate with development teams to enhance system reliability and performance, applying a platform engineering mindset to system administration tasks
- Troubleshoot and resolve issues in our dev, test, and production environments
- Participate in postmortem analysis and create preventative measures for future incidents
- Implement and maintain security best practices across our infrastructure, ensuring compliance with industry standards and internal policies. Participate in security audits and vulnerability assessments
- Participate in capacity planning and forecasting efforts to ensure our systems can handle future growth and demand. Analyze trends and make recommendations for resource allocation
- Identify and address performance bottlenecks through code profiling, system analysis, and configuration tuning. Implement and monitor performance metrics to proactively identify and resolve issues
- Develop, maintain, and test disaster recovery plans and procedures to ensure business continuity in the event of a major outage or disaster. Participate in regular disaster recovery exercises
- Contribute to internal knowledge bases and documentation
