HimalayasHimalayas logo
Lightning AILA

Infrastructure Operations Engineer

Lightning AI builds the AI development platform that streamlines building, training, and deploying AI models from idea to production with integrated cloud infrastructure.

Lightning AI

Employee count: 201-500

Salary: 160k-200k USD

United States only

Stay safe on Himalayas

Never send money to companies. Jobs on Himalayas will never require payment from applicants.

Who We Are

Lightning AI is the company behind PyTorch Lightning. Founded in 2019, we build an end-to-end platform for developing, training, and deploying AI systems—designed to take ideas from research to production with less friction.

Through our merger with Voltage Park, a neocloud and AI Factory, Lightning AI combines developer-first software with cost-efficient, large-scale compute. Teams get the tools they need for experimentation, training, and production inference, with security, observability, and control built in.

We serve solo researchers, startups, and large enterprises. Lightning AI operates globally with offices in New York City, San Francisco, Seattle, and London, and is backed by Coatue, Index Ventures, Bain Capital Ventures, and Firstminute.

What We're Looking For

Lightning AI is seeking an experienced Infrastructure Operations Engineers to help scale and operate our next-generation AI infrastructure platform. Our InfraOps team sits at the center of reliability, automation, and operational scale for GPU infrastructure. This team owns break/fix operations, incident response, customer provisioning, observability, and the automation systems that keep complex infrastructure running efficiently.

In this role, you’ll work hands-on with large-scale GPU environments, Linux systems, bare metal infrastructure, provisioning workflows, and platform reliability. You’ll partner closely with Infrastructure Engineering, Network Operations, and Software Platform teams to troubleshoot issues, improve operational efficiency, and build automation that reduces manual toil over time.

We’re flexible on location for this team. This role can work hybrid out of one of our US-based hubs (Seattle, NYC, or SF) or fully remote within the U.S., with occasional company and team offsites. We are not able to provide visa sponsorship for this position at this time.

What You'll Do

  • At the direction of the Manager of Infrastructure Operations, design, build, and roll out new platforms and patterns to minimize incidents and enable customer facing and internal features.
  • Deploy updates and improvements to support both Voltage Park’s internal and end customer use cases.
  • Collaborate with colleagues in Infrastructure Engineering, Network Operations, Customer Success and Software and Platform Development Teams.
  • Participate in the on-call rotation which is evenly distributed across all team members in a primary / secondary pattern where you are primary then move to a secondary position.

What You Will Need

Required Qualifications

  • 8+ years working with Linux as a server / hosting platform, extra points for Ubuntu experience.
  • 5+ years experience with AWS.
  • 2+ years experience with Kubernetes and strong container fundamentals.
  • 2+ years experience with Terraform and Ansible
  • 2+ years with network attached storage management (via NFS, ceph, or other protocols). Extra points for experience with VAST storage systems.
  • Experience with monitoring systems (Prometheus, ELK stack).
  • Familiarity with the gitops workflow.
  • Software development experience using Python, Go, bash, or other languages for the purposes of automation & connecting systems & APIs together.
  • Deep networking fundamentals, extra points for experience with datacenter level networks, 400Gb ethernet, and Infiniband.
  • Experience building and delivering complex systems.
  • Effective at navigating tradeoffs between design, risk, cost, and outcomes.
  • Comfortable with navigating ambiguity.
  • Strong written and oral communication.

Nice-to-Haves

  • Experience with bare metal hardware troubleshooting and provisioning, extra points for working with Dell hardware.
  • Experience with GPU servers, both in bare metal form or under virtualization.
  • Deep experience with network switches, routers, and firewalls, particularly SONiC switches, Palo Alto firewalls and Juniper Networks as vendors.
  • Experience with VAST storage systems

Compensation

We are committed to offering competitive compensation that reflects the value each team member brings to our mission. Final offers are based on factors such as experience, skills, geographic location, and role expectations. In addition to base salary, our total rewards package for eligible roles includes a discretionary bonus, a meaningful equity component, and comprehensive benefits.

The anticipated annual base salary range for this role is:
$160,000—$200,000 USD

Benefits and Perks

We offer a comprehensive and competitive benefits package designed to support our employees’ health, well-being, and long-term success. Benefits may vary by location, team, and role.

Benefits include:

  • Comprehensive medical, dental and vision coverage (U.S.); Private medical and dental insurance (U.K.)
  • Retirement and financial wellness support (U.S.); Pension contribution (U.K.)
  • Generous paid time off, plus holidays
  • Paid parental leave
  • Professional development support
  • Wellness and work-from-home stipends
  • Flexible work environment

At Lightning AI, we are committed to fostering an inclusive and diverse workplace. We believe that diverse teams drive innovation and create better products. We provide equal employment opportunities to all employees and applicants without regard to race, color, religion, gender, sexual orientation, gender identity, national origin, age, disability, veteran status, or any other protected characteristic. We are dedicated to building a culture where everyone can thrive and contribute to their fullest potential.

About the job

Apply before

Posted on

Job type

Full Time

Experience level

Salary

Salary: 160k-200k USD

Experience

8 years minimum

Location requirements

Hiring timezones

United States +/- 0 hours

About Lightning AI

Learn more about Lightning AI and their company culture.

View company profile

Lightning AI builds the first operating system for AI development. The company behind PyTorch Lightning provides an end-to-end platform that streamlines the entire machine learning workflow - from idea and prototyping to training, deployment, and production scaling. Lightning eliminates infrastructure complexity by providing cloud-based GPUs, collaborative development environments, and integrated tools that let developers focus on building intelligent systems rather than managing servers.

The platform serves over 10,000 organizations globally, ranging from solo researchers to Fortune 500 enterprises. Through its merger with Voltage Park, Lightning combines developer-first software with cost-efficient, large-scale compute infrastructure. The company has raised $127 million from investors including Coatue, Index Ventures, Bain Capital Ventures, Firstminute, Cisco Investments, JP Morgan, K5 Global, and NVIDIA. Lightning operates globally with offices in New York, San Francisco, Seattle, and London.

Employee benefits

Learn about the employee benefits and perks provided at Lightning AI.

View benefits

Health insurance

Comprehensive health insurance coverage for employees.

Meal reimbursement

Company provides meal reimbursement for employees during work hours.

Retirement benefits

401(k) and retirement planning benefits for long-term financial security.

Generous vacation policy

Employees enjoy a generous vacation policy to recharge and maintain sustained performance.

View Lightning AI's employee benefits
Claim this profileLightning AI logoLA

Lightning AI

View company profile

Similar remote jobs

Here are other jobs you might want to apply for.

View all remote jobs

5 remote jobs at Lightning AI

Explore the variety of open remote roles at Lightning AI, offering flexible work options across multiple disciplines and skill levels.

View all jobs at Lightning AI

Remote companies like Lightning AI

Find your next opportunity by exploring profiles of companies that are similar to Lightning AI. Compare culture, benefits, and job openings on Himalayas.

View all companies

Find your dream job

Sign up now and join over 100,000 remote workers who receive personalized job alerts, curated job matches, and more for free!

Sign up
Himalayas profile for an example user named Frankie Sullivan