As an HPC Software Engineer, you will be responsible for designing, developing, and optimizing HPC software solutions that leverage NVIDIA GPUs, AMD CPUs, and MPI for distributed and parallel computing. This role will require a strong foundation in computational science, parallel algorithms, and the specific intricacies of GPU and CPU optimization. You’ll also play a key role in designing and implementing scalable solutions for our HPC clusters, ensuring high-performance applications can run efficiently as nodes scale up to meet expanding global workloads.
- Health Insurance/HMO
- Enjoy unlimited MadMax Coffee
- Diverse learning & growth opportunities
- Accessible Cloud HR platform (Sprout)
- Above standard leaves
Develop and Optimize HPC Software: Design and implement software solutions that harness the power of NVIDIA GPUs and AMD CPUs, focusing on maximizing computational efficiency and speed.
Parallel and Concurrent Computing: Employ expertise in MPI, CUDA, OpenMP, and other parallel computing techniques to ensure that applications scale effectively on multi-core and multi-GPU systems.
Scaling and Cluster Management: Contribute to the design, deployment, and scaling of HPC nodes to ensure clusters can expand based on data and computational demands. Optimize software and cluster configurations to enable seamless scaling of nodes.
Algorithm Optimization: Identify and implement algorithms best suited for GPU/CPU computing to handle complex computations and data-intensive tasks, with a focus on scalable solutions that maintain performance as node count increases.
Node Profiling and Performance Analysis: Conduct profiling and benchmarking of HPC applications across scaled environments to identify bottlenecks and ensure consistent performance across nodes. Use tools such as NVIDIA Nsight, AMD μProf, and Intel VTune for performance tuning.
Collaboration Across Teams: Work closely with scientists, engineers, and IT staff to ensure HPC applications are optimized and scalable for various research and production environments.
Documentation and Reporting: Create clear documentation for all code and scaling processes. Report on performance metrics, improvements, and results to stakeholders.
Education: Bachelor's or master's degree in computer science, Electrical Engineering, Computational Science, or a related field. Advanced degrees or relevant certifications are a plus.
Experience: 3+ years of experience in HPC development, with a focus on parallel and distributed computing, and experience with scalable systems. 5+ years previous experience in scientific software development.
Technical Skills:
NVIDIA GPU Programming: Proficiency in CUDA and familiarity with NVIDIA’s HPC software stack, including cuDNN, NCCL, and TensorRT.
Parallel Programming: Expertise in MPI, OpenMP, and other parallel programming paradigms.
Cluster Scaling and Optimization: Strong understanding of scaling clusters and optimizing software for distributed, multi-node environments.
Languages: Strong proficiency in C/C++, Python, and Ansible / Terraform / Packer
Tools and Frameworks: Familiarity with profiling and performance optimization tools (e.g., NVIDIA Nsight, AMD μProf, Intel VTune).
Soft Skills: Ability to work independently and as part of a team, excellent problem-solving abilities, strong communication skills, and a keen attention to detail.
Familiarity with HPC Environment Management: Experience with Slurm, PBS, or other job scheduling and resource management tools.
Experience with Containerization: Knowledge of Singularity or Docker in an HPC context.
Experience with Cloud Scaling Solutions: Understanding of cloud-based HPC scaling solutions or hybrid HPC environments is a plus.
Machine Learning Acceleration: Experience with machine learning frameworks that utilize GPUs, such as TensorFlow, PyTorch, or PySCF is a plus.
