Job Description
Summary
Description
Key Responsibilities
- Evaluate and integrate hardware accelerators for the training and inference of deep learning models.
- Collaborate with cross-functional teams to prototype and develop solutions that enhance the performance of our ML stack.
- Integrate new technologies (eg. torch.compile, XLA, Model Parallelism, Tensor Parallelism, FP8) into the existing codebase to enable large-scale deployment of ML models, significantly reducing engineering iteration time.
- Profile system performance, identify bottlenecks, and optimize runtime performance.
- Work closely with external partners to ensure smooth integration of hardware accelerators with our existing infrastructure.
- Stay up-to-date with the latest advancements in hardware accelerators and deep learning technologies.
Minimum Qualifications
- Excellent Python coding skills
- Experience with GPUs, TPUs, AWS Trainium
- Familiarity with PyTorch
- Excellent interpersonal skills
Preferred Qualifications
- ML System experience - Large Scale distributed Training and Inference.
- Familiarity with JAX, XLA, HLO, Torch Compile
- Experience implementing/optimizing CUDA kernels