Senior SRE — AI/ML GPU HPC Infrastructure

1 week ago

Toronto, Canada Boson AI Full time

A technology company in Toronto seeks a Senior Site Reliability Engineer to manage and optimize HPC cluster operations in a datacenter equipped with advanced GPUs. The ideal candidate has over 5 years of experience, proficiency in Linux and Kubernetes, and skills in automation tools. Responsibilities include managing infrastructure, supporting ML teams, and developing automation for operational efficiency. The salary range is competitive at $150,000 to $250,000 annually.
#J-18808-Ljbffr

Senior SRE: AI/ML HPC Infra

2 weeks ago

Toronto, Canada Boson AI Full time

A technology-driven AI company is seeking a Site Reliability Engineer to manage and optimize their advanced GPU cluster in Toronto. You'll be engaged in planning, deployment, and operation of HPC infrastructure while working closely with engineering teams. Ideal candidates will have a strong foundation in Linux systems, Kubernetes, and significant experience...
Senior SRE: AI/ML HPC Infra

2 weeks ago

Toronto, Canada Boson AI Full time

A technology-driven AI company is seeking a Site Reliability Engineer to manage and optimize their advanced GPU cluster in Toronto. You'll be engaged in planning, deployment, and operation of HPC infrastructure while working closely with engineering teams. Ideal candidates will have a strong foundation in Linux systems, Kubernetes, and significant experience...
Senior SRE — AI/ML GPU HPC Infrastructure

1 week ago

Toronto, Canada Boson AI Full time

A technology company in Toronto seeks a Senior Site Reliability Engineer to manage and optimize HPC cluster operations in a datacenter equipped with advanced GPUs. The ideal candidate has over 5 years of experience, proficiency in Linux and Kubernetes, and skills in automation tools. Responsibilities include managing infrastructure, supporting ML teams, and...
Senior SRE — AI/ML GPU HPC Infrastructure

2 weeks ago

Toronto, Canada Boson AI Full time

A technology company in Toronto seeks a Senior Site Reliability Engineer to manage and optimize HPC cluster operations in a datacenter equipped with advanced GPUs. The ideal candidate has over 5 years of experience, proficiency in Linux and Kubernetes, and skills in automation tools. Responsibilities include managing infrastructure, supporting ML teams, and...
Senior SRE for AI/ML HPC Infra

1 week ago

Toronto, Canada Boson AI Full time

A technology company in Toronto is seeking a Senior Site Reliability Engineer to manage and optimize their HPC cluster operations. The role includes deploying infrastructure-as-code solutions and supporting research teams with cluster optimization. The ideal candidate will have over 5 years of experience in SRE or HPC operations, proficiency in Linux and...
Senior SRE for AI/ML HPC Infra

6 days ago

Toronto, Canada Boson AI Full time

A technology company in Toronto is seeking a Senior Site Reliability Engineer to manage and optimize their HPC cluster operations. The role includes deploying infrastructure-as-code solutions and supporting research teams with cluster optimization. The ideal candidate will have over 5 years of experience in SRE or HPC operations, proficiency in Linux and...
Site Reliability Engineer, AI/ML Infrastructure

2 weeks ago

Toronto, Canada Boson AI Full time

Site Reliability Engineer, AI/ML Infrastructure This range is provided by Boson AI. Your actual pay will be based on your skills and experience — talk with your recruiter to learn more. Base pay range CA$150,000.00/yr - CA$250,000.00/yr About The Role We're looking for a Senior Site Reliability Engineer to help us run one of the most exciting GPU clusters...
Site Reliability Engineer, AI/ML Infrastructure

2 weeks ago

Toronto, Canada Boson AI Full time

Site Reliability Engineer, AI/ML Infrastructure This range is provided by Boson AI. Your actual pay will be based on your skills and experience — talk with your recruiter to learn more. Base pay range CA$150,000.00/yr - CA$250,000.00/yr About The Role We're looking for a Senior Site Reliability Engineer to help us run one of the most exciting GPU clusters...
Site Reliability Engineer, AI/ML Infrastructure

2 weeks ago

Toronto, Canada Boson AI Full time

Site Reliability Engineer, AI/ML Infrastructure This range is provided by Boson AI. Your actual pay will be based on your skills and experience — talk with your recruiter to learn more. Base pay range CA$150,000.00/yr - CA$250,000.00/yr About The Role We're looking for a Senior Site Reliability Engineer to help us run one of the most exciting GPU clusters...
Site Reliability Engineer, AI/ML Infrastructure

4 weeks ago

Toronto, Canada Boson AI Full time

Site Reliability Engineer, AI/ML Infrastructure Base pay range: CA$150,000.00/yr - CA$250,000.00/yr About The Role We're looking for a Senior Site Reliability Engineer to help us run one of the most exciting GPU clusters around—our Toronto datacenter packed with NVIDIA H100 and A100 GPUs, over 20PB of Ceph storage, terabit networking, and hundreds of...

Americas

Europe

Asia / Oceania

Africa

Senior SRE — AI/ML GPU HPC Infrastructure