Job Description:
The Senior Platform Engineer is a hands-on engineering role on a platform development team responsible for building and operating Flexential's IT platforms including observability, devops, ITSM incident and release mgmt, and Integrations technologies. This role develops and manages critical platform subsystems for high availability, operational resiliency, security and scalability utilizing native-AI enablement for all outcomes. This is an individual contributor role with significant technical ownership and direct impact on critical Flexential technology roadmapYou will work across infrastructure, automation, and application layers — deploying Kubernetes workloads, authoring Terraform modules, building Ansible playbooks, and building GitLab pipelines that other engineers depend on daily.
Key Responsibilities and Essential Job Functions:
Design, develop and operationally manage automated, resilient, high availability, self-healing, secure platforms with native-AI capabilities for IT needs, serving both internal as well as customer business capabilities
Develop, andmanage theObservabilityOpenTelemetryCentral Backend Stack: Grafana Enterprise, Mimir, Loki, Tempo, andAlertmanageron Kubernetes/RKE2 via Helm and GitLab CI-CD.
Buildand manageiaCandCI-CDfor automatedprovisiongand deployment,includingTerraform modules forInfra/VM/storage provisioning, Ansible AWX playbooks for OS/App bootstrap,ArgoCDandHelmforKubernetes configuration.
Developand manageOpenTelemetryPrometheus scrape profile libraryincludingSNMP exporters, REST API exporters, and cloud provider exporters (CloudWatch, Azure Monitor, GCP)for multipledevice classes.
Develop AIOps capabilities on platforms fore.gObservabilityuse-cases: anomaly detection integrations, event correlation rules inAlertmanager, and synthetic monitoring patterns to reduce alert noise.
Configure and maintain Zabbix auto-discovery: network range scanning, device classification, and Prometheus service discovery integration.
Build and harden Edge Stack deployments (Prometheus +OTelcollector) per data centersiteusingGitOpstemplates.
IntegrateAlertmanagerwith ServiceNow: webhook routing, ticket enrichment, auto-close logic, and escalation policy configuration.
Maintain platform security:Conjur/CyberArk secret injection at runtime,mTLSbetween stack components, RBAC in Grafana Enterprise.
Author and maintain Grafana dashboards in JSON/GitLab — facility overview, network health, RED metrics, application telemetry.
Mentor mid-level engineers, lead code reviews, andestablishengineering standards for the team.Represent platform engineering in cross-functional architecture reviews and executive-level program updates.
Perform other duties asrequiredand assigned
Required Qualifications:
DevOps / Automation - 5+ yearsin a production environment, Kubernetes (RKE2/k3s), Helm chart deployment,systemservices, Docker/container
LGTM StackDevelopment andConfiguration -4+ years:Grafana, Mimir, Loki, Tempo configuration, tuning,dash-boardingand production operations; Prometheusrequired
Senior-level Python / Scripting frameworks - 5+ years, Automation scripts, exporter development, GitLab pipeline scripting, REST API integrations
GitOps/ CI/CD - 5+ years, GitLab CI/CD pipeline authoring; Terraform and Ansible as primaryIaCtools;ArgoCDor Flux preferred
AIOps / Observability Engineering-2+ years,Alertmanagerrule authoring, anomaly detection integration, event correlation, noise reduction techniques
Workinginfrastructure (Linux/VM)managementknowledge- 5+ years, Linux administration, VMware vCenter/VCFexperience,Netappstoragemanagement, network fundamentals (SNMP, TCP/IP)
Secrets Management-2+ years,CyberArk/Conjur,HashiCorpVault, or equivalent — runtime secret injection patterns
Minimal travel may berequired
Preferred Skills:
Experience and/or knowledge of ITSM processes and workflow automatione.g.Incident & ResponseMgmt(IRM),Release mgmt., ServiceNow ITSM integration, alert routing, escalation policy design, SLA-driven on-call workflows
Hands-onexperience orworkingknowledge of Boomi integrations PaaS(iPaaS)technologies
Experience working with BAS / BMS systems in a Datacenter / OT environment.
Hands-on experience working with AWS products in a Well-architected Framework and multi-account model to develop various compute, storage, networkiaaSand PaaS services for IT applications.
Base Pay Range: Annualized/Hourly salary range offered for this position is estimated to be $150,000 - $165,000. However, the actual pay range depends on each candidate’s experience, location, and qualifications.
Not meeting every single requirement? No problem! We are looking for candidates who possess unique skills that set them apart from the rest. If you're enthusiastic about this role and believe you have the skills and abilities that would make you successful, don't hesitate to apply today!
Benefits of working at Flexential:
• Medical, Telehealth, Dental and Vision
• 401(k)
• Health Savings Accounts (HSA) and Flexible Spending Accounts (FSA)
• Life and AD&D
• Short Term and Long-Term disability
• Flex Paid Time Off (PTO)
• Leave of Absence
• Employee Assistance Program
• Wellness Program
• Rewards and Recognition Program
Benefits are subject to change at the Company's discretion.
Flexential participates in the E-Verify program. Please click here for more information.
EEOC Statement: Flexential is an equal opportunity employer, and all qualified applicants will receive consideration for employment without regard to race, color, religion, sex, sexual orientation, gender identity or expression, pregnancy, age, national origin, disability status, genetic information, protected veteran status, or any other characteristic protected by law.
