High Performance Computing Engineer

3 weeks ago


Toronto, Ontario, Canada Boson AI Full time
Boson AI is a startup building large language tools for everyone to use. Our founders (Alex Smola, Mu Li), and a team of Deep Learning, Optimization, NLP, AutoML and Statistics scientists and engineers are working on high quality generative AI models for language, audio, and entertainment.

About The Role

We are looking for a Senior High Performance Computing Engineer to help us operate the GPUs, network and filesystem in our datacenter deployment in Toronto. The ideal candidate needs to have strong problem solving skills and an ability to learn new tools. Experience with Slurm, MAAS, Ceph, Infiniband, NVIDIA deepops, Ethernet networking and related tools are a big plus. You should be comfortable performing some amount of hardware configuration.

You will have the opportunity to work with NVIDIA H100 and A100 GPUs, over 20PB of storage, Terabit networking and hundreds of computers. You will be responsible for deploying and operating a broad range of infrastructure technologies and hardware systems.

A day in the life:

- Manage private large high-end GPU clusters
- Responsible for full lifecycle of physical systems including deployments of new hardware, operations, triage and troubleshooting
- Configure and maintain network switches (Tomahawk Ethernet, Mellanox Infiniband)
- Configure and maintain MAAS, Ceph, Slurm and Kubernetes
- Configure and automate on-premises Linux-based systems at scale using infrastructure-as-code practices
- Configure and maintain network, e.g. Layer 3 networking
- Learn about new tools and deploy them

You might be a great fit if you have:

- Strong background in high performance computing
- Experience with with on-premises Data Center operations and technologies
- Experience in managing a large hardware cluster
- Proficiency in at least one programming language (e.g. Python) and ability to write clean, maintainable code
- Experience in designing, deploying, and maintaining production-grade machine learning systems at scale
- Familiarity with GPU utilization for machine learning workloads and optimization techniques
- Experience with managing firmware / systems updates for systems, e.g. on SuperMicro

$150,000 - $250,000 a year

The ability to solve problems and to learn new techniques is key.

#J-18808-Ljbffr

  • Toronto, Ontario, Canada Boson AI Full time

    About Boson AI">Boson AI is a leading developer of cutting-edge large language tools for users worldwide. Our team of experienced scientists and engineers, led by Alex Smola and Mu Li, focus on crafting high-quality generative AI models for language, audio, and entertainment.">Job Description">We are seeking a highly skilled Senior High Performance Computing...


  • Toronto, Ontario, Canada beBee Careers Full time

    We are looking for a skilled software engineer to develop software that supports optimized placement and routing of FPGA devices. The successful candidate will have experience with EDA/CAD optimization algorithms, data structure design, and graph theory.The ideal candidate will possess excellent problem-solving skills, attention to detail, and great...


  • Toronto, Ontario, Canada 207 Altera Semiconductor Technology Canada ULC Full time

    We are looking for a talented High-Performance Computing Specialist to join our team at 207 Altera Semiconductor Technology Canada ULC. As a key member of our team, you will be responsible for developing and optimizing software solutions for advanced FPGA devices.Key Skills and Qualifications:Bachelor's degree or Master's degree in Computer Engineering,...


  • Toronto, Ontario, Canada Untether Full time

    Required Skills and QualificationsTo be successful in this role, you will need to have a strong background in computer science or engineering, with 5+ years of experience in software development. You should be a creative problem solver, passionate about solving hard problems, and have strong C++ development skills. Working hardware knowledge, familiarity...


  • Toronto, Ontario, Canada beBee Careers Full time

    Overview:We are seeking a Senior Software Engineer to lead the development of our large-scale storage solutions. The ideal candidate will have a strong understanding of Ceph, experience with high-performance computing, and excellent problem-solving skills.Responsibilities:Design, manage, and maintain large storage arraysIntegrate storage arrays with Deep...


  • Toronto, Ontario, Canada Motion Recruitment Full time

    We are a cutting-edge AI-driven automation company revolutionizing our industry.The role of Algorithm / C++ Engineer is ideal for those passionate about solving complex technical challenges and developing high-performance computational algorithms.In this position, you will collaborate with a multidisciplinary team across multiple locations to ensure code...


  • Toronto, Ontario, Canada beBee Careers Full time

    Job SummaryWe are seeking a seasoned software engineer to join our team as a Batch Processing Specialist. In this role, you will be responsible for designing, building and maintaining high-scale, distributed systems that power our batch processing infrastructure.Key AccountabilitiesDevelop and implement high-performance, scalable distributed systems.Maintain...


  • Toronto, Ontario, Canada CentML Full time

    **Mission Statement:**CentML's mission is to make AI more accessible and cost-effective, enabling everyone to harness its power.**Job Summary:**We are seeking a skilled compiler engineer to join our team in developing a state-of-the-art compiler for machine learning systems. As a key member of our team, you will play a critical role in pushing the frontier...


  • Toronto, Ontario, Canada Syntronic Full time

    Job DescriptionYou will be part of an engineering group in the development and delivery of next-generation cutting-edge automotive embedded software. Your responsibilities will include:Integration, release-testing, and deployment of Adaptive AUTOSAR stack into use casesDefining architecture and design of application software based on Adaptive AUTOSAR for...


  • Toronto, Ontario, Canada Arm Limited Full time

    About UsArm Limited is at the epicenter of the world's largest computing ecosystem, powering every technology revolution going forward by redefining the ways people live, work, play, and learn with sustainable and far-reaching positive impact.Our ambitious global team of over 6,000 pioneers unites hardware engineers, software engineers, data analysts, and...