Staff Engineer, HPC Infrastructure
1 week ago
Tenstorrent is leading the industry on cutting-edge AI technology, revolutionizing performance expectations, ease of use, and cost efficiency. With AI redefining the computing paradigm, solutions must evolve to unify innovations in software models, compilers, platforms, networking, and semiconductors. Our diverse team of technologists have developed a high performance RISC-V CPU from scratch, and share a passion for AI and a deep desire to build the best AI platform possible. We value collaboration, curiosity, and a commitment to solving hard problems. We are growing our team and looking for contributors of all seniorities.
We're seeking a Staff HPC Engineer who thrives on turning hundreds of bare-metal compute nodes into consistent, production-ready clusters through automation and infrastructure-as-code. You'll design and maintain OS deployment pipelines that provision nodes in minutes, use Ansible to eliminate configuration drift across global sites, and ensure RHEL/Ubuntu systems stay performant and reliable as our compute demands scale exponentially. In semiconductor design, where millions of EDA jobs run daily, your automation work directly translates to faster design cycles and higher cluster utilization.
This role is hybrid, based out of Austin, TX, Santa Clara, CA, or Toronto, CA.
We welcome candidates at various experience levels for this role. During the interview process, candidates will be assessed for the appropriate level, and offers will align with that level, which may differ from the one in this posting.
Who You Are
- Deep experience with IBM Spectrum LSF or similar workload managers.
- Strong background in commercial HPC storage platforms such as Pure Storage FlashBlade, Weka, NetApp, etc.
- Hands-on experience with container technologies (Docker, Singularity, Podman).
- Solid Linux system administration skills.
- Understanding of HPC networking, storage architectures, and job scheduling.
- Ability to diagnose and resolve complex infrastructure issues independently.
- Comfortable working in a startup environment with rapidly changing requirements.
What We Need
- Design and maintain automated bare-metal provisioning pipelines that deploy hundreds of compute nodes globally with consistent configurations.
- Implement infrastructure-as-code practices using Ansible to manage large-scale OS configuration across diverse hardware platforms.
- Own the lifecycle management of RHEL and Ubuntu systems—from initial deployment through patching, upgrades, and performance tuning.
- Build automation and tooling to streamline provisioning, patching, and system updates as the compute environment scales.
- Troubleshoot OS-level issues, optimize kernel parameters, and resolve system performance bottlenecks that impact EDA workflows.
- Work directly with hardware design teams to standardize system configurations, toolchains, and development environments.
- Deploy and lifecycle manage systems across Tenstorrent's global engineering sites, ensuring consistency and reliability.
Nice to Have
- Experience supporting EDA tools and hardware design workflows in production HPC environments.
- Hands-on expertise with commercial HPC storage platforms (Pure Storage, Weka, NetApp) and workload managers (LSF, Slurm).
- Container technologies (Docker, Singularity, Podman) for reproducible compute environments at scale.
- Advanced provisioning techniques (PXE boot, kickstart, cloud-init) and modern infrastructure automation patterns.
- Cluster monitoring and observability tools (Prometheus, Grafana) for managing thousands of compute nodes.
- Security hardening and compliance frameworks for multi-tenant semiconductor design environments.
- Integration of open-source and commercial tools to improve provisioning efficiency and reliability.
- Work in a deeply technical environment solving infrastructure challenges that directly impact chip design velocity.
Compensation for all engineers at Tenstorrent ranges from $100k - $500k including base and variable compensation targets. Experience, skills, education, background and location all impact the actual offer made.
Tenstorrent offers a highly competitive compensation package and benefits, and we are an equal opportunity employer.
This offer of employment is contingent upon the applicant being eligible to access U.S. export-controlled technology. Due to U.S. export laws, including those codified in the U.S. Export Administration Regulations (EAR), the Company is required to ensure compliance with these laws when transferring technology to nationals of certain countries (such as EAR Country Groups D:1, E1, and E2). These requirements apply to persons located in the U.S. and all countries outside the U.S. As the position offered will have direct and/or indirect access to information, systems, or technologies subject to these laws, the offer may be contingent upon your citizenship/permanent residency status or ability to obtain prior license approval from the U.S. Commerce Department or applicable federal agency. If employment is not possible due to U.S. export laws, any offer of employment will be rescinded.
-
Staff Software Engineer, GPU Infrastructure
4 weeks ago
Toronto, Canada The Rundown AI, Inc. Full timeWho are we? Our mission is to scale intelligence to serve humanity. We’re training and deploying frontier models for developers and enterprises who are building AI systems to power magical experiences like content generation, semantic search, RAG, and agents. We believe that our work is instrumental to the widespread adoption of AI. We obsess over what we...
-
Senior SRE
7 days ago
Toronto, Canada Boson AI Full timeA leading AI technology firm in Toronto is seeking a Senior Site Reliability Engineer to manage and optimize HPC cluster operations. You will deploy infrastructure-as-code solutions and support research teams with cluster optimization. Ideal candidates have over 5 years in SRE or HPC operations, proficiency in Linux, and experience with Kubernetes. The role...
-
HPC Software Engineer
3 weeks ago
Edmonton, Toronto, Montreal, Calgary, Vancouver, Old Toronto, Ottawa, Mississauga, Quebec, Winnipeg, Halifax, Saskatoon, Burnaby, Hamilton, Surrey, Victoria, London, Halton Hills, Regina, Markham, Brampton, Vaughan, Kelowna, Laval, Southwestern Ontario, R, Canada Canonical Full timeJoin to apply for the HPC Software Engineer role at Canonical1 month ago Be among the first 25 applicantsJoin to apply for the HPC Software Engineer role at CanonicalCanonical is a leading provider of open source software and operating systems to the global enterprise and technology markets. Our platform, Ubuntu, is very widely used in breakthrough...
-
HPC Software Engineer
3 weeks ago
Sherbrooke, Toronto, Montreal, Calgary, Vancouver, Edmonton, Old Toronto, Ottawa, Mississauga, Quebec, Winnipeg, Halifax, Saskatoon, Burnaby, Hamilton, Surrey, Victoria, London, Halton Hills, Regina, Markham, Brampton, Vaughan, Kelowna, Laval, Southwester, Canada Canonical Full timeJoin to apply for the HPC Software Engineer role at Canonical1 month ago Be among the first 25 applicantsJoin to apply for the HPC Software Engineer role at CanonicalGet AI-powered advice on this job and more exclusive features.Canonical is a leading provider of open source software and operating systems to the global enterprise and technology markets. Our...
-
HPC Software Engineer
3 weeks ago
Winnipeg, Toronto, Montreal, Calgary, Vancouver, Edmonton, Old Toronto, Ottawa, Mississauga, Quebec, Halifax, Saskatoon, Burnaby, Hamilton, Surrey, Victoria, London, Halton Hills, Regina, Markham, Brampton, Vaughan, Kelowna, Laval, Southwestern Ontario, R, Canada Canonical Full timeJoin to apply for the HPC Software Engineer role at CanonicalContinue with Google Continue with Google1 month ago Be among the first 25 applicantsJoin to apply for the HPC Software Engineer role at CanonicalGet AI-powered advice on this job and more exclusive features.Sign in to access AI-powered advicesContinue with Google Continue with GoogleContinue with...
-
HPC Engineer
2 weeks ago
Toronto, Canada Wyatt Partners Full timeIf you are interested in managing large AI clusters equipped with Nvidia AI chips, this role could be a great fit for you. You will lead a team and be responsible for maintaining the environment and ensuring the smooth operation of one of our client's major AI Clusters. We welcome applications from engineers who have not previously led a team or see this...
-
HPC Engineer
2 weeks ago
Toronto, Canada Wyatt Partners Full timeIf you are interested in managing large AI clusters equipped with Nvidia AI chips, this role could be a great fit for you.You will lead a team and be responsible for maintaining the environment and ensuring the smooth operation of one of our client's major AI Clusters.We welcome applications from engineers who have not previously led a team or see this role...
-
HPC Engineer
2 weeks ago
Toronto, Canada Wyatt Partners Full timeIf you are interested in managing large AI clusters equipped with Nvidia AI chips, this role could be a great fit for you. You will lead a team and be responsible for maintaining the environment and ensuring the smooth operation of one of our client's major AI Clusters. We welcome applications from engineers who have not previously led a team or see this...
-
Site Reliability Engineer, AI/ML Infrastructure
4 weeks ago
Toronto, Canada Boson AI Full timeAbout The Role We're looking for a Senior Site Reliability Engineer to help us run one of the most exciting GPU clusters around—our Toronto datacenter packed with NVIDIA H100 and A100 GPUs, over 20PB of Ceph storage, terabit networking, and hundreds of servers. You'll be hands‑on with the full lifecycle of HPC infrastructure: planning, building, testing,...
-
Toronto, Ontario, Canada Boson AI Full time $120,000 - $180,000 per yearAbout The Role We're looking for a Senior Site Reliability Engineer to help us run one of the most exciting GPU clusters around—our Toronto datacenter packed with NVIDIA H100 and A100 GPUs, over 20PB of Ceph storage, terabit networking, and hundreds of servers You'll be hands-on with the full lifecycle of HPC infrastructure: planning, building, testing,...