Essential Responsibilities:
- Build and own tools and libraries that accelerate Artera’s ability to develop, launch, and monitor AI products.
- Work with model developers to optimize GPU and CPU efficiency and data throughput of large-scale foundation models and downstream model training runs.
- Optimize Artera’s ability to store and process terabytes of digital pathology data efficiently for the use in serving large-scale training regimes.
- Ensure that Artera’s observability infrastructure provides a clear picture of how to continue to optimize performance across our model landscape.
Experience Requirements:
- 4+ years of industry software engineering experience
- 3+ years of industry experience using one of PyTorch, TensorFlow, or JAX in Python
- 2+ years of industry experience building with AWS, Docker, and Kubernetes
- 1+ years of industry experience optimizing large-scale, high data-throughput, distributed machine learning training pipelines
Desired:
- Experience using Terraform, SqlAlchemy
- Experience using ML orchestration frameworks such as Ray, Kubeflow, Metaflow, MLFlow, Flyte, Dagster, Argo Workflow or Prefect
- Experience deploying and maintaining infrastructure for machine learning training and production inference
- Familiarity with TorchScript, ONNXRuntime, DeepSpeed, AWS Neuron or similar approaches to inference optimization