Job Description
Summary
Description
- Lead the team to design and implement automation for model training, testing, validation, and deployment
- Collaborate with machine learning engineers to ensure efficient deployment and scaling of ML models
- Implement monitoring and alerting systems to track model performance, system health, and data drift
- Optimize compute resources for cost and performance efficiency
- Manage model versions to ensure traceability and reproducibility
Minimum Qualifications
- 6+ years of experience in the design and implement of Large-scale ML Systems or Distributed Systems
- Experience with model pipeline and registry tools, detecting and preventing model drift, automating model monitoring, and ensuring model accuracy
- Proficiency in programming languages such as Python, Java or Golang
- Effective communication skills in written and spoken English
- Bachelor or above in Software Engineering, Computer Science, Machine Learning, or a related field
Preferred Qualifications
- Experience in machine learning frameworks such as TensorFlow, PyTorch, AutoGluon, XGBoost or Scikit-learn
- Experienced in DevOps Tools such as Docker, Jenkins, Ansible, Grafana, Prometheus, Elastic, or Kubernetes
- Familiar with CI/CD deployment practices
- Experience with SQL and database systems such as PostgreSQL
- Experience with building ETL pipeline in data warehouse such as Snowflake
- Experience with inference optimization