HimalayasHimalayas logo
KMC Solutions IncKI

XTN-DAD3686 | SITE RELIABILITY ENGINEER

KMC Solutions Inc
United States only

Stay safe on Himalayas

Never send money to companies. Jobs on Himalayas will never require payment from applicants.

You will be working on validating and testing GPU clusters prior to production release, ensuring hardware integrity, system reliability, and optimal performance. This role involves provisioning clusters, executing performance benchmarks, maintaining automated validation frameworks, and troubleshooting Linux-based systems in high-performance compute environments. You will collaborate closely with engineering and operations teams to ensure seamless handovers and production readiness.

.• Health Insurance/HMO

• Enjoy unlimited MadMax Coffee

• Diverse learning & growth opportunities
• Accessible Cloud HR platform (Sprout)

• Above standard leaves

Cluster Validation & Testing

  • Validate GPU clusters of varying sizes to ensure hardware and system integrity prior to production release

  • Perform functional and reliability testing of GPUs, servers, and associated components

  • Verify network connectivity and performance, including InfiniBand where applicable

Orchestration & Benchmarking

  • Provision and configure GPU clusters using automated workflows

  • Execute and analyse performance and stability benchmarks orchestrated via Slurm

  • Validate results against expected performance and reliability thresholds

Test Framework & Automation

  • Maintain and extend the automated validation framework built using Python and Ansible

  • Integrate new test cases to support additional hardware platforms and GPU generations

  • Improve test reliability, coverage, and execution efficiency

Remediation & System Integrity

  • Diagnose and remediate unhealthy nodes through configuration changes or software fixes

  • Coordinate with on-site support and Smart Hands teams for hardware replacements when required

  • Ensure all issues are resolved and documented prior to handover to production operations

Documentation & Handover

  • Produce clear, accurate documentation of test results, hardware states, and remediation actions

  • Ensure smooth handovers to operations and engineering teams

  • Maintain up-to-date runbooks and validation procedures

Essential

• Strong hands-on experience administering and troubleshooting Linux systems (Prio)
• Confident use of CLI tools for diagnostics, including analysis of kernel logs, drivers, and system

services

• Excellent written and verbal English communication skills
• High standards for system reliability, consistency, and documentation
Preferred / Desirable
• Experience working with GPU-based or high-performance compute environments
• Familiarity with Slurm or other workload schedulers
• Understanding of datacenter hardware lifecycle and server validation processes
• Exposure to InfiniBand or high-speed networking technologies
• Experience working with distributed or remote infrastructure teams
• Proficiency in Python for automation, test execution, and parsing results (Preferred)
• Proven experience writing and maintaining Ansible playbooks (Preferred)

.

About the job

Apply before

Posted on

Job type

Full Time

Experience level

Location requirements

Hiring timezones

United States +/- 0 hours

About KMC Solutions Inc

Learn more about KMC Solutions Inc and their company culture.

View company profile
Claim this profileKMC Solutions Inc logoKI

KMC Solutions Inc

View company profile

Similar remote jobs

Here are other jobs you might want to apply for.

View all remote jobs

303 remote jobs at KMC Solutions Inc

Explore the variety of open remote roles at KMC Solutions Inc, offering flexible work options across multiple disciplines and skill levels.

View all jobs at KMC Solutions Inc

Find your dream job

Sign up now and join over 100,000 remote workers who receive personalized job alerts, curated job matches, and more for free!

Sign up
Himalayas profile for an example user named Frankie Sullivan