Himalayas logo
CoreWeaveCO

Infrastructure Reliability Engineer, Bare Metal

CoreWeave is a specialized AI cloud provider delivering a massive scale of GPU compute resources on the industry's fastest and most flexible infrastructure, purpose-built for AI, machine learning, and VFX rendering workloads.

CoreWeave

Employee count: 501-1000

Salary: 122k-163k USD

United States only

Stay safe on Himalayas

Never send money to companies. Jobs on Himalayas will never require payment from applicants.

CoreWeave is the AI Hyperscaler™, delivering a cloud platform of cutting edge services powering the next wave of AI. Our technology provides enterprises and leading AI labs with the most performant, efficient and resilient solutions for accelerated computing. Since 2017, CoreWeave has operated a growing footprint of data centers covering every region of the US and across Europe. CoreWeave was ranked as one of the TIME100 most influential companies of 2024.

As the leader in the industry, we thrive in an environment where adaptability and resilience are key. Our culture offers career-defining opportunities for those who excel amid change and challenge. If you’re someone who thrives in a dynamic environment, enjoys solving complex problems, and is eager to make a significant impact, CoreWeave is the place for you. Join us, and be part of a team solving some of the most exciting challenges in the industry.

CoreWeave powers the creation and delivery of the intelligence that drives innovation.

We seek a highly skilled and driven Infrastructure Reliability Engineer, Bare Metalto join our team and report to our Senior Director, Customer Experience. In this role, you will be instrumental in ensuring the stability, performance, and ongoing improvement of our intricate bare metal infrastructure. This role demands a deep technical understanding of underlying hardware and related systems, coupled with a proactive approach to problem-solving and operational efficiency. You will collaborate extensively with diverse engineering teams and external vendors, driving automation initiatives and refining operational strategies within a rapidly expanding, technologically advanced environment. Your contributions will directly impact the core of our compute capabilities.

Please note: This role will be based in Pacific Time Zone (PST) hours.

What You'll Do

  • Provide expert-level technical support and in-depth troubleshooting for a wide spectrum of hardware and associated software issues, encompassing server malfunctions, network outages, and performance degradations.
  • Manage the lifecycle of our bare metal infrastructure, including overseeing deployment methodologies, executing maintenance procedures, coordinating upgrades, and managing hardware retirement processes.
  • Architect and implement automation solutions through scripting and tooling to streamline repetitive operational tasks, enhance overall efficiency, and minimize manual intervention across the infrastructure.
  • Lead the development and refinement of critical operational processes, comprehensive technical documentation (SOPs, TSGs, runbooks), and the establishment of engineering best practices to bolster team effectiveness and infrastructure resilience.
  • Engage in close collaboration with Software, Network, and Data Center Operations Engineering teams to facilitate effective issue resolution, contribute to strategic project planning, and ensure the cohesive operation of the entire infrastructure ecosystem.
  • Serve as a key technical point of contact for hardware and software vendors, managing technical support engagements, overseeing the RMA process, and driving the resolution of complex hardware-centric challenges.
  • Design, deploy, and maintain sophisticated monitoring and alerting frameworks to proactively identify and mitigate potential infrastructure anomalies and performance deviations.
  • Participate actively in incident response protocols, conduct thorough root cause analysis (RCAs) for infrastructure events, and contribute to problem management strategies aimed at preventing future occurrences.
  • Contribute technical expertise to and potentially lead infrastructure-focused projects, including new hardware deployments, critical system upgrades, and the integration of new operational tooling.
  • Mentor and guide junior engineering team members, fostering technical growth and contributing to the development of internal knowledge resources and training programs.
  • Maintain the integrity of hardware asset tracking and related data within our infrastructure inventory systems (e.g., Snipe-IT).
  • Adhere to and promote stringent security protocols and best practices related to infrastructure access and maintenance activities.

Who You Are

  • Bachelor's degree in Computer Science, Electrical Engineering, or equivalent experience
  • 5+ years of experience in hands-on management and support of complex bare metal infrastructure environments and data center operations
  • Comprehensive understanding of modern server hardware architectures, including specialized compute accelerators (GPUs) and high-speed interconnect technologies from leading high-performance computing vendors such as NVIDIA, Dell, or HPE.
  • Demonstrated expertise in Linux system administration, encompassing deep familiarity with command-line operations and system configuration.
  • Proficiency in at least one high-level scripting language (e.g., Python) and practical experience with infrastructure and/or network automation tools, methodologies, and frameworks (e.g., Ansible)
  • Extensive experience with modern infrastructure monitoring and logging tools such as Prometheus, Grafana, and the ELK stack (Elasticsearch, Logstash, Kibana).
  • Working knowledge of enterprise ticketing systems (e.g., Jira) and an understanding of IT Service Management (ITSM) frameworks and best practices.
  • Strong analytical and problem-solving skills, with the ability to systematically diagnose and resolve complex technical issues.
  • Excellent communication and collaboration abilities, with experience working effectively across multidisciplinary technical teams.
  • Self-motivated and proactive, with a demonstrated sense of ownership and a commitment to ensuring infrastructure reliability and performance.
  • Proven ability to manage multiple tasks and priorities effectively in a fast-paced and dynamic environment.

The base salary range for this role is $122,000 to $163,000. The starting salary will be determined based on job-related knowledge, skills, experience, and market location. We strive for both market alignment and internal equity when determining compensation. In addition to base salary, our total rewards package includes a discretionary bonus, equity awards, and a comprehensive benefits program (all based on eligibility).

What We Offer

The range we’ve posted represents the typical compensation range for this role. To determine actual compensation, we review the market rate for each candidate which can include a variety of factors. These include qualifications, experience, interview performance, and location.

In addition to a competitive salary, we offer a variety of benefits to support your needs, including:

  • Medical, dental, and vision insurance - 100% paid for by CoreWeave
  • Company-paid Life Insurance
  • Voluntary supplemental life insurance
  • Short and long-term disability insurance
  • Flexible Spending Account
  • Health Savings Account
  • Tuition Reimbursement
  • Ability to Participate in Employee Stock Purchase Program (ESPP)
  • Mental Wellness Benefits through Spring Health
  • Family-Forming support provided by Carrot
  • Paid Parental Leave
  • Flexible, full-service childcare support with Kinside
  • 401(k) with a generous employer match
  • Flexible PTO
  • Catered lunch each day in our office and data center locations
  • A casual work environment
  • A work culture focused on innovative disruption

Our Workplace

While we prioritize a hybrid work environment, remote work may be considered for candidates located more than 30 miles from an office, based on role requirements for specialized skill sets. New hires will be invited to attend onboarding at one of our hubs within their first month. Teams also gather quarterly to support collaboration

California Consumer Privacy Act - California applicants only

CoreWeave is an equal opportunity employer, committed to fostering an inclusive and supportive workplace. All qualified applicants and candidates will receive consideration for employment without regard to race, color, religion, sex, disability, age, sexual orientation, gender identity, national origin, veteran status, or genetic information.

As part of this commitment and consistent with the Americans with Disabilities Act (ADA), CoreWeave will ensure that qualified applicants and candidates with disabilities are provided reasonable accommodations for the hiring process, unless such accommodation would cause an undue hardship. If reasonable accommodation is needed, please contact: careers@coreweave.com.

Export Control Compliance

This position requires access to export controlled information. To conform to U.S. Government export regulations applicable to that information, applicant must either be (A) a U.S. person, defined as a (i) U.S. citizen or national, (ii) U.S. lawful permanent resident (green card holder), (iii) refugee under 8 U.S.C. § 1157, or (iv) asylee under 8 U.S.C. § 1158, (B) eligible to access the export controlled information without a required export authorization, or (C) eligible and reasonably likely to obtain the required export authorization from the applicable U.S. government agency. CoreWeave may, for legitimate business reasons, decline to pursue any export licensing process.

About the job

Apply before

Posted on

Job type

Full Time

Experience level

Mid-level
Senior

Salary

Salary: 122k-163k USD

Location requirements

Hiring timezones

United States +/- 0 hours

About CoreWeave

Learn more about CoreWeave and their company culture.

View company profile

We are CoreWeave, the AI Hyperscaler™, and we're on a mission to revolutionize the way large-scale GPU-accelerated workloads are handled in the cloud computing industry. Since our founding in 2017, initially as Atlantic Crypto focused on Ethereum mining, we've pivoted and dedicated ourselves to building a cloud platform specifically designed for the demanding needs of AI and machine learning. We recognized early on the transformative potential of Generative AI and the immense computational power it would require. This foresight led us to repurpose our GPU capacity for high-performance computing, a decision that has positioned us at the forefront of the AI revolution.

Our CoreWeave Cloud Platform is engineered from the ground up, offering cutting-edge software and cloud services that deliver the automation and efficiency necessary to manage complex AI infrastructure at scale. We provide access to a massive scale of NVIDIA GPUs, including the latest H100 and Blackwell architectures, across our growing footprint of data centers in the United States and Europe. We're not just about providing hardware; we're about delivering a comprehensive suite of services, including GPU and CPU compute, high-performance storage, and robust networking solutions. Our Kubernetes-native architecture is designed to support large-scale, GPU-intensive tasks, making it easier for AI labs, enterprises, and innovators to train, fine-tune, and deploy their models faster and more cost-effectively. We pride ourselves on tackling the hard problems in AI infrastructure, working closely with our customers to push the boundaries of what's possible. We're committed to continuous learning and innovation, empowering our employees to take ownership and drive progress as we build the cloud for the AI era.

Employee benefits

Learn about the employee benefits and perks provided at CoreWeave.

View benefits

Vision Insurance

Offered through VSP.

Company equity

Offered by CoreWeave.

Childcare benefits

Offered by CoreWeave.

Generous parental leave

Offered by CoreWeave.

View CoreWeave's employee benefits
Claim this profileCoreWeave logoCO

CoreWeave

View company profile

Similar remote jobs

Here are other jobs you might want to apply for.

View all remote jobs

11 remote jobs at CoreWeave

Explore the variety of open remote roles at CoreWeave, offering flexible work options across multiple disciplines and skill levels.

View all jobs at CoreWeave

Remote companies like CoreWeave

Find your next opportunity by exploring profiles of companies that are similar to CoreWeave. Compare culture, benefits, and job openings on Himalayas.

View all companies

Find your dream job

Sign up now and join over 100,000 remote workers who receive personalized job alerts, curated job matches, and more for free!

Sign up
Himalayas profile for an example user named Frankie Sullivan