Himalayas logo
FirstPrinciplesFI

Member of Technical Staff, Training Engineer (Large Scale Foundation Models)

FirstPrinciples is a non-profit foundation established in 2024 to advance our understanding of the universe's fundamental science and marry this knowledge with innovative technologies for the betterment of humanity.

FirstPrinciples

Employee count: 1-10

About FirstPrinciples:FirstPrinciples is a non-profit organization building an autonomous AI Physicist designed to advance humanity's understanding of the fundamental laws of nature. Our goal for the AI Physicist is to achieve a breakthrough that unifies quantum field theory general relativity and to explain the deepest unresolved phenomena in our universe by 2035. To do this, we're pioneering a new approach to scientific discovery by creating an intelligent system that can explore theoretical frameworks, reason across disciplines, and generate novel insights. We're a non-profit that operates like a tech start-up by moving quickly and continuously iterating to accelerate scientific progress. By combining AI, symbolic reasoning, and autonomous research capabilities, we're developing a platform that goes beyond analyzing existing knowledge to actively contribute to physics research.

Job Description:We're seeking a Member of Technical Staff, Training Engineer to develop and lead end-to-end pre-training of large language models on GPU clusters as we build the AI Physicist to revolutionize fundamental physics research. You'll make critical modeling choices, guide the development of data pipelines, and perform distributed training at scale, all guided by rigorous evaluation frameworks. This role requires you to combine deep engineering expertise with research intuition to push throughput, stability, and final capability while productionizing successful ideas into repeatable training runs and reusable tooling. You'll be instrumental in building the foundation models that power the AI Physicist, ensuring every training run brings us closer to breakthrough scientific discoveries.

Key Responsibilities:

Model Training Optimization:

  • Design and run large-scale pre-training experiments for both dense and MoE architectures, from experiment planning through multi-week production runs.
  • Tune optimizer configurations (AdamW/Adafactor/Sophia variants), learning rate schedules with warmup strategies, dropout, gradient clipping, weight decay, EMA, and activation checkpointing to ensure stability at scale.
  • Own model and training recipes end-to-end, making informed decisions about microbatch and global batch configurations.
  • Run ablations and scaling-law studies to set optimal tokens-to-train targets, entropy/perplexity goals, and checkpoint cadence that optimize cost-to-quality ratios.
  • Provide strategic insights to the executive team on financial implications of major decisions, from international expansion to new research initiatives.
  • Design capital allocation frameworks that maximize scientific impact while ensuring long-term sustainability.

Data Pipeline Engineering:

  • Build and harden high-throughput data pipelines encompassing dataset curation, filtering, deduplication, pack-by-length optimization, and contamination control.
  • Design and implement multilingual and multimodal data ingest systems with intelligent repeat scheduling (e.g., D4-style approaches).
  • Architect comprehensive data pipelines across diverse modalities (web/book/code/speech/vision) with filtering, heuristic and learned scoring, temperature sampling, multilingual balancing, and curriculum learning.
  • Demonstrate measurable impact from data quality work including large-scale deduplication, contamination audits, and repeat/mixture scheduling that improves downstream accuracy.

Distributed Training Performance:

  • Operate distributed training infrastructure using FSDP/ZeRO, tensor/pipeline/expert/context parallelism, and high-speed interconnects (NCCL, NVLink/InfiniBand).
  • Choose and configure optimal distributed strategies (FSDP vs ZeRO; 3D/5D hybrid parallelism for MoE) and launch parameters, documenting trade-offs for future reference.
  • Exploit modern kernels and mixed-precision training (FlashAttention-3, FP8 via NVIDIA Transformer Engine) to maximize tokens/sec while maintaining perplexity targets.
  • Integrate performance primitives including FlashAttention-3, fused optimizers, and custom CUDA/Triton kernels while maintaining convergence guarantees.
  • Write production-grade PyTorch and Triton/CUDA kernels when required to unlock critical performance gains.

Reliability Observability:

  • Debug complex distributed training issues including deadlocks, OOMs, divergence, and stragglers using tools like Nsight, py-spy, TensorBoard, and WB.
  • Build comprehensive observability systems for long-horizon runs tracking throughput/efficiency, gradient statistics, loss spikes, token-mix drift, data freshness, and evaluation dashboards.
  • Manage multi-node GPU jobs (SLURM/Kubernetes/Ray), debug NCCL hangs, clock skew issues, and implement elastic restart mechanisms.
  • Shepherd multi-week training jobs through completion, recover gracefully from failures, and deliver stable checkpoints with measurable evaluation wins.
  • Establish systems for managing multiple currencies, cross-border partnerships, international payments, and complex funding structures.
  • Create financial frameworks that can adapt to new funding models, from traditional grants to innovative financing mechanisms.

Evaluation Collaboration:

  • Define evaluation suites and red-team protocols to monitor scaling behavior and catch regression signals over long training runs.
  • Partner with safety and alignment teams on SFT/RLAIF/DPO stages and evaluations, ensuring pre-training choices support downstream alignment objectives.
  • Collaborate across research, infrastructure, product, and safety teams to turn research wins into robust model artifacts and services.
  • Lead cross-functional efforts and mentor engineers on distributed training best practices and stabilization techniques.
  • Write crisp RFCs and retrospectives to document learnings and establish institutional knowledge.

Qualifications:

  • Educational Background: Bachelor's or Master's degree in Computer Science, Engineering, or related field.
  • Experience: 7-12+ years of total experience, including 2+ years training large Transformers at scale (10B→100B+ parameters; MoE experience is a plus) with a track record of shipped models or published training methods.
    • Hands-on experience with at least one frontier-style training run where you've shepherded multi-week training jobs, recovered from failures, and delivered stable checkpoints with measurable evaluation improvements.
  • Skills:
    • Expert-level proficiency in PyTorch (including compiled mode/torch.compile), with strong understanding of CUDA/Triton fundamentals.
    • Deep facility with distributed frameworks (PyTorch FSDP or DeepSpeed ZeRO) and multi-dimensional parallelism (TP/PP/EP/DP/CP), ideally with Megatron-Core experience.
    • Proven success operating multi-node GPU jobs with experience debugging NCCL hangs, clock skew, and elastic restarts.
    • Demonstrated impact from data quality work, including deduplication/contamination mitigation and data-mix design that measurably improved evaluation metrics.
    • Strong applied mathematics background for training stability (optimization, numerics, initialization, learning rate scaling) with excellent experiment design and statistical rigor.
  • Collaboration Communication: Ability to work cross-functionally. Strong communicator who can simplify complex topics for diverse audiences.
  • Mindset: Entrepreneurial mission-driven, comfortable in a fast-growing, startup-style environment, and motivated by the ambition of tackling one of the greatest scientific challenges in history.
  • Demonstrated passion for physics and for making scientific knowledge accessible and impactful.

Bonus Skills:

  • MoE pre-training experience including router design, load-balancing, expert capacity tuning, z-loss, auxiliary losses, and parallelism mapping across thousands of GPUs.
  • Accelerator-aware optimization expertise (kernel fusion, TMA/warp-specialization, cache locality) and production adoption of FlashAttention-3 and FP8 training on Hopper/Blackwell architectures.
  • Modern evaluation and safety exposure including contamination detection, leakage/membership inference awareness.
  • Experience guiding model design decisions for inference efficiency (KV-cache strategies, quantization, speculative decoding).
  • Advanced throughput optimization techniques: sequence packing with dynamic padding, fused attention/MLP, gradient accumulation tuned to saturate interconnects.
  • Expertise in stability at scale: BF16/FP8 mixed precision with delayed scaling, norm-based clipping, cosine decay with warmup, EMA on very-large runs.
  • MoE reliability expertise: router jitter/noise management, capacity factor tuning, token-dropless routing, and expert parallel + tensor/pipeline co-design.
  • Deep understanding of data quality impact: aggressive deduplication (near-dup fuzzy matching), contamination audits, and intelligent repeat scheduling strategies versus one-epoch-over-everything approaches.

Application Process:

  • Interested candidates are invited to submit their resume, a cover letter detailing their qualifications and vision for the role, and references. Please include "Member of Technical Staff, Training Engineer, Large Scale Foundation Models" in the cover letter.

Join us at FirstPrinciples and be a part of a transformative journey where science drives progress and unlocks the potential of humanity.

About the job

Apply before

Posted on

Job type

Full Time

Experience level

Mid-level

Location requirements

Open to candidates from all countries.

Hiring timezones

Worldwide

About FirstPrinciples

Learn more about FirstPrinciples and their company culture.

View company profile

At FirstPrinciples, we envision a transformative future where a deeper understanding of the universe's fundamental principles drives significant advancements in science and technology. Established in 2024 by Ildar Shar, a visionary combining entrepreneurial experience with a passion for physics, FirstPrinciples is committed to unraveling the complex tapestry of reality.

Our mission transcends traditional scientific endeavors; it is about harnessing the power of fundamental science to catalyze innovations that will improve the quality of life for everyone. We focus on fostering a global community of researchers and innovators who are not afraid to challenge the status quo.

Claim this profileFirstPrinciples logoFI

FirstPrinciples

View company profile

Similar remote jobs

Here are other jobs you might want to apply for.

View all remote jobs

5 remote jobs at FirstPrinciples

Explore the variety of open remote roles at FirstPrinciples, offering flexible work options across multiple disciplines and skill levels.

View all jobs at FirstPrinciples

Remote companies like FirstPrinciples

Find your next opportunity by exploring profiles of companies that are similar to FirstPrinciples. Compare culture, benefits, and job openings on Himalayas.

View all companies

Find your dream job

Sign up now and join over 100,000 remote workers who receive personalized job alerts, curated job matches, and more for free!

Sign up
Himalayas profile for an example user named Frankie Sullivan
FirstPrinciples hiring Member of Technical Staff, Training Engineer (Large Scale Foundation Models) • Remote (Work from Home) | Himalayas