Aleph Alpha is hiring a Senior Performance Engineer- Pretraining to grow its pre-training efficiency team. The role involves engineering systems to train foundation models at scale and optimizing hardware utilization and training throughput on large-scale GPU clusters.
Requirements
- Proficient in Python and the PyTorch library.
- Strong engineering background in parallel and/or distributed systems with proven track record of excellence.
- Hands-on experience with modern machine learning techniques, especially large language models and their life cycle.
- Deeply understand the CUDA programming model.
- Experience in distributed programming with APIs like NCCL or MPI.
- Experience analysing profiling traces with tools such as PyTorch Profiler and Nvidia Nsight.
Benefits
- Competitive salary and equity package
- 30 days of paid vacation
- Access to a variety of fitness & wellness offerings via Wellhub
- Mental health support through nilo.health
- JobRad Bike Lease
- Substantially subsidized company pension plan
- Subsidized Germany-wide transportation ticket
- Budget for additional technical equipment
- Flexible working hours
