AethirAE

Infrastructure Operations Engineer (GPU Computing) - Enterprise AI

Aethir is a decentralized cloud infrastructure (DCI) provider focused on delivering enterprise-grade GPU-as-a-Service for AI and cloud gaming applications.

Aethir

Employee count: 51-200

Taiwan only

Aethir is a pioneering technology company at the forefront of GPU-based compute infrastructure, specializing in cutting-edge solutions for diverse industries ranging from AI and machine learning to high-performance computing (HPC). We're dedicated to pushing the boundaries of what's possible, leveraging the latest advancements in hardware and software to empower our clients with unparalleled computational capabilities.

About the Role:

We are seeking a highly skilled and motivated Infrastructure Operations Engineer to join our dynamic team. As an integral member of the InfraOps team, you will play a key role in managing and optimizing our GPU-based compute infrastructure (across multiple locations and partners), ensuring maximum performance, scalability, and reliability.

Responsibilities:

  • Infrastructure Management: Deploy, configure, and maintain GPU-based compute infrastructure, including servers, storage, networking, and associated software stack. Aethir facilitates compute from dozens of providers around the world, from 4090s to H200s.
  • Monitoring and Optimization: Implement robust monitoring and alerting systems to proactively identify performance bottlenecks, resource constraints, and potential failures. Continuously optimize infrastructure to improve performance, efficiency, and cost-effectiveness.
  • Automation and Orchestration: Develop automation scripts and tools to streamline deployment, configuration, and management of infrastructure components. Implement infrastructure as code (IaC) principles to enable rapid provisioning and scaling.
  • Security and Compliance: Implement and enforce security best practices to safeguard sensitive data and ensure compliance with relevant regulations and industry standards. Conduct regular security audits and vulnerability assessments.
  • Incident Response and Troubleshooting: Provide tier-3 support for infrastructure-related issues, investigating root causes and implementing timely resolutions. Participate in on-call rotation to respond to critical incidents outside of regular business hours.
  • Capacity Planning and Scaling: Collaborate with cross-functional teams to forecast resource requirements, plan capacity upgrades, and scale infrastructure to accommodate growing workloads and user demands.
  • Documentation and Knowledge Sharing: Maintain comprehensive documentation of infrastructure configurations, procedures, and troubleshooting guidelines. Share knowledge and best practices with team members to foster continuous learning and skill development.

Requirements

  • Experience in infrastructure operations, preferably in a DevOps or SRE role or Sales Engineering or Solution Architect role - focused on GPU compute.
  • Proficiency in managing GPU-based compute infrastructure, including NVIDIA GPUs and CUDA programming.
  • Strong expertise in Linux system administration and shell scripting (e.g., Bash, Python).
  • Experience with configuration management tools (e.g., Ansible, Chef, Puppet) and version control systems (e.g., Git).
  • Familiarity with containerization and orchestration technologies (e.g., Docker, Kubernetes).
  • Solid understanding of networking concepts, protocols, and troubleshooting techniques.
  • Excellent analytical and problem-solving skills, with a proactive and results-oriented mindset.
  • Effective communication skills and the ability to collaborate effectively with cross-functional teams. We operate in English, but speaking Mandarin as well is a big bonus as we have engineering teams in China and Southeast Asia.
  • Experience with cloud computing platforms (e.g., AWS, Azure, GCP) and hybrid cloud architectures.
  • Knowledge of HPC frameworks and job scheduling systems (e.g., Slurm, PBS Pro).
  • Familiarity with GPU-accelerated libraries and frameworks (e.g., TensorFlow, PyTorch, CUDA Toolkit).
  • Understanding of cybersecurity principles and practices, including encryption, access controls, and threat detection/prevention.
  • Bonus if you know Web3 (cryptocurrency, tokenization of RWAs, mining/staking, etc.).

Benefits

  • Competitive compensation structure (and flexible on fiat/token mix).
  • Can be flexible on benefits, depending on location and setup.
  • Salary is also flexible depending on location and setup.
  • Flexible work hours and remote work options.

About the job

Apply before

Posted on

Job type

Full Time

Experience level

Mid-level

Location requirements

Hiring timezones

Taiwan +/- 0 hours

About Aethir

Learn more about Aethir and their company culture.

View company profile

Aethir is revolutionizing the way enterprise clients and individual users access high-performance GPU computing power, particularly for the demanding needs of AI and cloud gaming. Many businesses and developers face significant hurdles when trying to secure scalable and cost-effective GPU resources. Traditional cloud providers often come with high costs, limited availability of the latest chips like NVIDIA H100s, and latency issues, especially for real-time applications. Aethir addresses these challenges head-on by building a decentralized cloud infrastructure (DCI). This innovative approach aggregates GPU power from a distributed global network of providers, including enterprises with idle capacity, data centers, and even individual contributors.

Our customers in the AI sector, who require immense processing capabilities for machine learning model training, fine-tuning, and inference, can tap into Aethir's network for on-demand access to enterprise-grade GPUs without the hefty investment in physical hardware or the constraints of centralized services. Similarly, the gaming industry benefits immensely. Aethir's infrastructure supports hundreds of thousands of cloud gaming players, delivering best-in-class, low-latency experiences worldwide. By decentralizing the GPU cloud, Aethir not only makes powerful computing more accessible and affordable but also fosters a community-driven ecosystem. This model democratizes access, allowing smaller developers and users in developing regions to leverage cutting-edge technology that was previously out of reach, thereby unlocking new potentials for innovation and growth across various digital frontiers.

Claim this profileAethir logoAE

Aethir

View company profile

Similar remote jobs

Here are other jobs you might want to apply for.

View all remote jobs

10 remote jobs at Aethir

Explore the variety of open remote roles at Aethir, offering flexible work options across multiple disciplines and skill levels.

View all jobs at Aethir

Remote companies like Aethir

Find your next opportunity by exploring profiles of companies that are similar to Aethir. Compare culture, benefits, and job openings on Himalayas.

View all companies

Find your dream job

Sign up now and join over 85,000 remote workers who receive personalized job alerts, curated job matches, and more for free!

Sign up
Himalayas profile for an example user named Frankie Sullivan