Senior SRE: AI/ML GPU HPC Infra

2 weeks ago


Toronto, Canada Boson AI Full time

A technology company in Toronto is seeking a Senior Site Reliability Engineer to manage and optimize a cutting-edge GPU cluster. The role involves hands-on lifecycle management of HPC infrastructure, troubleshooting, and developing automation for operational efficiency. Candidates should have over 5 years of experience in SRE or HPC and be proficient in Linux and Kubernetes. The position offers a competitive salary of $150,000 - $250,000 a year.
#J-18808-Ljbffr



  • Toronto, Canada Boson AI Full time

    A technology company in Toronto is seeking a Senior Site Reliability Engineer to manage and optimize a cutting-edge GPU cluster. The role involves hands-on lifecycle management of HPC infrastructure, troubleshooting, and developing automation for operational efficiency. Candidates should have over 5 years of experience in SRE or HPC and be proficient in...


  • Toronto, Canada Boson AI Full time

    A technology company in Toronto is seeking a Senior Site Reliability Engineer to manage and optimize a cutting-edge GPU cluster. The role involves hands-on lifecycle management of HPC infrastructure, troubleshooting, and developing automation for operational efficiency. Candidates should have over 5 years of experience in SRE or HPC and be proficient in...


  • Toronto, Canada Boson AI Full time

    A leading technology company in Toronto is seeking a Senior High Performance Computing Engineer to manage one of the most advanced GPU clusters. You'll handle the full lifecycle of HPC infrastructure, from planning to deployment, and work closely with engineering teams. Candidates should have 5+ years of experience in HPC operations, proficiency in Linux,...


  • Toronto, Canada Boson AI Full time

    A leading technology company in Toronto is seeking a Senior High Performance Computing Engineer to manage one of the most advanced GPU clusters. You'll handle the full lifecycle of HPC infrastructure, from planning to deployment, and work closely with engineering teams. Candidates should have 5+ years of experience in HPC operations, proficiency in Linux,...


  • Toronto, Canada Boson AI Full time

    Base pay range CA$150,000.00/yr - CA$250,000.00/yr About The Role We're looking for a Senior High Performance Computing Engineer to help us run one of the most exciting GPU clusters around—our Toronto datacenter packed with NVIDIA H100 and A100 GPUs, over 20PB of Ceph storage, terabit networking, and hundreds of servers. You'll be hands‑on with the full...


  • Toronto, Canada Boson AI Full time

    Base pay range CA$150,000.00/yr - CA$250,000.00/yr About The Role We're looking for a Senior High Performance Computing Engineer to help us run one of the most exciting GPU clusters around—our Toronto datacenter packed with NVIDIA H100 and A100 GPUs, over 20PB of Ceph storage, terabit networking, and hundreds of servers. You'll be hands‑on with the full...

  • Senior HPC

    4 weeks ago


    Toronto, Canada Boson AI Full time

    A leading tech company in Toronto is seeking a Senior High Performance Computing Engineer to manage a GPU cluster and support ML teams. This role requires 5+ years of HPC operations experience, proficiency in Linux systems, and knowledge of Kubernetes. Candidates will develop automation solutions and optimize infrastructure in a dynamic environment. The...


  • Toronto, Ontario, Canada Boson AI Full time US$150,000 - US$250,000

    About The RoleWe're looking for a Senior High Performance Computing Engineer to help us run one of the most exciting GPU clusters around—our Toronto datacenter packed with NVIDIA H100 and A100 GPUs, over 20PB of Ceph storage, terabit networking, and hundreds of servers.You'll be hands-on with the full lifecycle of HPC infrastructure: planning, building,...


  • Toronto, Ontario, Canada Boson AI Full time $120,000 - $180,000 per year

    About The Role We're looking for a Senior High Performance Computing Engineer to help us run one of the most exciting GPU clusters around—our Toronto datacenter packed with NVIDIA H100 and A100 GPUs, over 20PB of Ceph storage, terabit networking, and hundreds of servers You'll be hands-on with the full lifecycle of HPC infrastructure: planning, building,...


  • Toronto, Canada Boson AI Full time

    About The Role We're looking for a Senior Site Reliability Engineer to help us run one of the most exciting GPU clusters around—our Toronto datacenter packed with NVIDIA H100 and A100 GPUs, over 20PB of Ceph storage, terabit networking, and hundreds of servers. You'll be hands‑on with the full lifecycle of HPC infrastructure: planning, building, testing,...