Senior SRE: AI/ML GPU HPC Infra
2 weeks ago
A technology company in Toronto is seeking a Senior Site Reliability Engineer to manage and optimize a cutting-edge GPU cluster. The role involves hands-on lifecycle management of HPC infrastructure, troubleshooting, and developing automation for operational efficiency. Candidates should have over 5 years of experience in SRE or HPC and be proficient in Linux and Kubernetes. The position offers a competitive salary of $150,000 - $250,000 a year.
#J-18808-Ljbffr
-
Senior SRE: AI/ML GPU HPC Infra
2 weeks ago
Toronto, Canada Boson AI Full timeA technology company in Toronto is seeking a Senior Site Reliability Engineer to manage and optimize a cutting-edge GPU cluster. The role involves hands-on lifecycle management of HPC infrastructure, troubleshooting, and developing automation for operational efficiency. Candidates should have over 5 years of experience in SRE or HPC and be proficient in...
-
Senior SRE: AI/ML GPU HPC Infra
2 weeks ago
Toronto, Canada Boson AI Full timeA technology company in Toronto is seeking a Senior Site Reliability Engineer to manage and optimize a cutting-edge GPU cluster. The role involves hands-on lifecycle management of HPC infrastructure, troubleshooting, and developing automation for operational efficiency. Candidates should have over 5 years of experience in SRE or HPC and be proficient in...
-
Toronto, Canada Boson AI Full timeA leading technology company in Toronto is seeking a Senior High Performance Computing Engineer to manage one of the most advanced GPU clusters. You'll handle the full lifecycle of HPC infrastructure, from planning to deployment, and work closely with engineering teams. Candidates should have 5+ years of experience in HPC operations, proficiency in Linux,...
-
Toronto, Canada Boson AI Full timeA leading technology company in Toronto is seeking a Senior High Performance Computing Engineer to manage one of the most advanced GPU clusters. You'll handle the full lifecycle of HPC infrastructure, from planning to deployment, and work closely with engineering teams. Candidates should have 5+ years of experience in HPC operations, proficiency in Linux,...
-
HPC Engineer, AI/ML Infrastructure
3 weeks ago
Toronto, Canada Boson AI Full timeBase pay range CA$150,000.00/yr - CA$250,000.00/yr About The Role We're looking for a Senior High Performance Computing Engineer to help us run one of the most exciting GPU clusters around—our Toronto datacenter packed with NVIDIA H100 and A100 GPUs, over 20PB of Ceph storage, terabit networking, and hundreds of servers. You'll be hands‑on with the full...
-
HPC Engineer, AI/ML Infrastructure
4 weeks ago
Toronto, Canada Boson AI Full timeBase pay range CA$150,000.00/yr - CA$250,000.00/yr About The Role We're looking for a Senior High Performance Computing Engineer to help us run one of the most exciting GPU clusters around—our Toronto datacenter packed with NVIDIA H100 and A100 GPUs, over 20PB of Ceph storage, terabit networking, and hundreds of servers. You'll be hands‑on with the full...
-
Senior HPC
4 weeks ago
Toronto, Canada Boson AI Full timeA leading tech company in Toronto is seeking a Senior High Performance Computing Engineer to manage a GPU cluster and support ML teams. This role requires 5+ years of HPC operations experience, proficiency in Linux systems, and knowledge of Kubernetes. Candidates will develop automation solutions and optimize infrastructure in a dynamic environment. The...
-
HPC Engineer, AI/ML Infrastructure
1 week ago
Toronto, Ontario, Canada Boson AI Full time US$150,000 - US$250,000About The RoleWe're looking for a Senior High Performance Computing Engineer to help us run one of the most exciting GPU clusters around—our Toronto datacenter packed with NVIDIA H100 and A100 GPUs, over 20PB of Ceph storage, terabit networking, and hundreds of servers.You'll be hands-on with the full lifecycle of HPC infrastructure: planning, building,...
-
HPC Engineer, AI/ML Infrastructure
1 week ago
Toronto, Ontario, Canada Boson AI Full time $120,000 - $180,000 per yearAbout The Role We're looking for a Senior High Performance Computing Engineer to help us run one of the most exciting GPU clusters around—our Toronto datacenter packed with NVIDIA H100 and A100 GPUs, over 20PB of Ceph storage, terabit networking, and hundreds of servers You'll be hands-on with the full lifecycle of HPC infrastructure: planning, building,...
-
Site Reliability Engineer, AI/ML Infrastructure
3 weeks ago
Toronto, Canada Boson AI Full timeAbout The Role We're looking for a Senior Site Reliability Engineer to help us run one of the most exciting GPU clusters around—our Toronto datacenter packed with NVIDIA H100 and A100 GPUs, over 20PB of Ceph storage, terabit networking, and hundreds of servers. You'll be hands‑on with the full lifecycle of HPC infrastructure: planning, building, testing,...