Open to opportunities

Sridevi Bejj

@sridevibejj

Message

Staff Site Reliability Engineer focused on automation, reliability, and scalable infrastructure.

United States

Message

What I'm looking for

I seek roles where I can architect and operate scalable, secure platforms, drive automation and observability, mentor teams, and improve production reliability in a collaborative, DevOps-focused culture.

I am a Staff Site Reliability Engineer with 15+ years supporting large-scale Linux, Big Data, cloud, and ML platforms in financial and enterprise environments. I specialize in automation, production reliability, observability, and 24x7 incident management, with proven expertise in Kubernetes, Hadoop ecosystems, GPU-enabled ML platforms, CI/CD pipelines, and Infrastructure as Code.

I have designed and managed cloud and on-prem infrastructure using Terraform, Chef, and Ansible, operated multi-tenant Kubernetes ML platforms, and maintained mission-critical Hadoop clusters. I deliver measurable reliability improvements through automation, monitoring (Prometheus, Grafana, Splunk), security controls (Kerberos, Ranger, LDAP), and disciplined ITSM practices.

Experience

Work history, roles, and key accomplishments

Current

Staff Site Reliability Engineer

Current

Visa

May 2023 - Present (3 years 2 months)

Maintain and support large-scale Hadoop clusters and Kubernetes-based ML platforms, improving availability and performance through upgrades, tuning, automation, and security controls. Lead incident response, vulnerability remediation, and monitoring to ensure production reliability for ETL and ML workloads.

Hadoop Kubernetes Spark Prometheus Ansible Python Kerberos Ambari Grafana

Linux Consultant

Broadridge Financial

Jul 2021 - Apr 2023 (1 year 9 months)

Built cloud and on-prem infrastructure with Terraform and automated provisioning using Chef and Ansible, improving deployment consistency and patching workflows. Implemented enterprise monitoring and scheduled patch automation to support production reliability.

Terraform Chef Ansible RHEL Splunk Zabbix Rundeck

Information Technology Specialist

New York State ITS

Aug 2018 - Jul 2021 (2 years 11 months)

Provided production support and automation for Linux servers, led OS upgrades and migrations (VMware to AWS), and managed configuration frameworks and Kubernetes/Docker environments to maintain 24x7 operations.

Ansible VMWare Kubernetes RHEL Chef HCX Shell Scripting

Lead Systems Engineer

Thomson Reuters

Feb 2009 - Feb 2015 (6 years)

Supported 1000+ Linux and Solaris servers, performing kernel tuning, storage management, and data center migrations to sustain production services and reduce incidents. Executed server builds, upgrades, and emergency changes via CAB processes.

Linux ZFS Storage Data Migration CABS System Administration

Infrastructure Specialist

Merrill Lynch

Jan 2008 - Feb 2009 (1 year 1 month)

Supported 2500+ production and development servers across the Americas, handling patching, backup recovery, cluster administration, and incident management to ensure enterprise service continuity.

RHEL Patching Incident Management

Infrastructure Engineer

Genpact

Apr 2003 - Jan 2008 (4 years 9 months)

Managed enterprise Linux and Solaris infrastructure for 6500+ servers, leading incident and change management under ITIL, and administering storage, backups, and kernel tuning to maintain operational stability.

Linux ITIL Incident Management Change Management Backup

Education

Degrees, certifications, and relevant coursework

Osmania University

Master of Information Systems, Information Systems

2001 - 2003

Completed Master's degree in Information Systems with coursework and practicals relevant to enterprise IT, systems administration, and infrastructure management.