This position is posted by Jobgether on behalf of a partner company. We are currently looking for a Software Reliability Engineer in United States.
As a Software Reliability Engineer, you’ll play a crucial role in ensuring system stability, rapid issue resolution, and platform resilience across large-scale distributed systems. This role goes beyond traditional DevOps or infrastructure tasks — you’ll dive directly into live systems, diagnosing complex incidents, identifying root causes, and preventing recurrences. Working closely with cross-functional teams, you will strengthen reliability across applications that impact millions of users. This position offers a highly collaborative environment, where your technical insight and problem-solving skills directly protect uptime, customer trust, and overall business continuity.
Accountabilities
- Lead incident response efforts to quickly diagnose and resolve issues in distributed production environments.
- Use observability and monitoring tools (such as Dynatrace and Azure Application Insights) to identify root causes and validate resolutions.
- Collaborate with engineers across APIs, microservices, and data layers to stabilize live systems and prevent future disruptions.
- Write and run targeted automated tests using tools like Jest, Cypress, or Playwright to confirm issue resolution and improve reliability.
- Communicate root causes and fixes effectively to both technical and non-technical stakeholders.
- Partner with platform and DevOps teams to enhance monitoring, alerting, and deployment workflows.
- Participate in on-call rotations for high-priority production incidents and contribute to continuous improvement of reliability practices.
Requirements
- Minimum 2 years of experience in software engineering, production support, or incident response.
- Strong proficiency in JavaScript/TypeScript with the ability to debug live applications and services.
- Solid understanding of SQL and NoSQL databases for tracing and troubleshooting data issues.
- Experience working within Azure or GCP cloud environments.
- Proven success stabilizing distributed or microservice-based architectures.
- Excellent communication and problem-solving skills, with the ability to clearly articulate findings.
- Preferred: experience managing P0/P1 incidents, knowledge of observability tools (Dynatrace, Datadog, or OpenTelemetry), and familiarity with event-driven architectures or message queues.
Benefits
- Competitive salary and comprehensive health coverage (medical, dental, vision).
- Flexible Paid Time Off (PTO) and 13 company holidays.
- 401(k) with company match and paid parental leave, including adoption assistance.
- Remote-first work model within the U.S., with occasional travel for team gatherings.
- Free Fitbit and a fun, mission-driven, and collaborative culture focused on improving lives.
Jobgether is a Talent Matching Platform that partners with companies worldwide to efficiently connect top talent with the right opportunities through AI-driven job matching.
When you apply, your profile goes through our AI-powered screening process designed to identify top talent efficiently and fairly.
🔍 Our AI evaluates your CV and LinkedIn profile thoroughly, analyzing your skills, experience, and achievements.
📊 It compares your profile to the job’s core requirements and past success factors to determine your match score.
🎯 Based on this analysis, we automatically shortlist the 3 candidates with the highest match to the role.
🧠 When necessary, our human team may perform an additional manual review to ensure no strong profile is missed.
The process is transparent, skills-based, and free of bias — focusing solely on your fit for the role. Once the shortlist is completed, we share it directly with the company that owns the job opening. The final decision and next steps (such as interviews or additional assessments) are then made by their internal hiring team.
