Sundar Bishnoi
@sundarbishnoi
Senior Site Reliability Engineer focused on self-healing, automation, and scalable reliability improvements.
What I'm looking for
I’m a Senior Site Reliability Engineer with 7+ years of experience improving reliability, automation, and scalability for large-scale private cloud environments. I focus on reducing MTTR/MTTD, strengthening production systems, and building self-healing infrastructure that keeps services resilient under pressure.
At Morgan Stanley, I designed and deployed Python-based self-healing systems for 20,000+ VMware ESXi hosts, automating remediation of infrastructure failures and eliminating ~85% of manual interventions while reducing incident resolution time (MTTR) by ~70–75%. I also redesigned monitoring and alerting pipelines using PagerDuty and observability tooling to cut MTTD by ~50–55%, and built Ansible-driven ESXi patching automation to orchestrate pre-checks, rollout, and validation across enterprise clusters.
Earlier at Capgemini, I delivered end-to-end observability dashboards, improved alert signal-to-noise through health rule optimization, and enhanced incident response metrics through proactive monitoring. I’ve also supported chaos testing and production releases, and I mentor others to drive adoption of SRE best practices, so reliability improvements become repeatable, measurable outcomes.
Experience
Work history, roles, and key accomplishments
Designed and deployed Python-based self-healing systems for 20,000+ VMware ESXi hosts, eliminating ~85% of manual interventions and reducing MTTR by ~70–75% while improving availability. Reduced MTTD by ~50–55% by redesigning monitoring and alerting pipelines with PagerDuty and observability tooling; also built Ansible-driven ESXi patch automation and a troubleshooting engine that cut manual debug
Automated CPU metric collection from AppDynamics for IHS applications across Zone2/Zone3 during Active/Passive testing using Python, packaged as a reusable utility for team adoption. Built end-to-end observability dashboards, reduced alert noise via health rule optimization, improved incident response metrics (MTTR/MTTD/MTTM), and executed chaos testing to find infrastructure bottlenecks.
Automated REST service health checks using Python and Jenkins pipelines, improving operational efficiency by 10x. Reduced incident volume by 10–15% through monitoring and alert optimization, and supported production releases to minimize customer impact.
Education
Degrees, certifications, and relevant coursework
National Institute of Technology, Tiruchirappalli
Bachelor of Technology (B.Tech.)
2014 - 2018
Earned a Bachelor of Technology (B.Tech.) from National Institute of Technology, Tiruchirappalli from 2014 to 2018.
Tech stack
Software and tools used professionally
Availability
Location
Authorized to work in
Job categories
Interested in hiring Sundar?
You can contact Sundar and 90k+ other talented remote workers on Himalayas.
Message SundarFind your dream job
Sign up now and join over 250,000+ remote workers who receive personalized job alerts, curated job matches, and more for free!
