Network Engineer, AI/ML Infrastructure
6 days ago
We're seeking an experienced Network Engineer to design, build, and optimize the high-performance networking infrastructure powering our AI/ML operations in Toronto. You'll work at the cutting edge of network technology—managing InfiniBand and ultra-high-speed Ethernet fabrics that connect NVIDIA H100 and A100 GPUs, over 20PB of Ceph storage, and hundreds of servers.
You'll be hands-on with the full lifecycle of our network infrastructure: planning, building, testing, deploying, and keeping everything running at peak performance. That means troubleshooting issues as they arise, monitoring network performance and throughput, developing automation to streamline operations, and working closely with HPC and ML teams to ensure they have the bandwidth they need. You'll also help us plan for future capacity and evaluate emerging network technologies as we scale to meet increasingly demanding workloads.
Responsibilities
- Configure and maintain InfiniBand and high-speed Ethernet fabrics
- Optimize network performance for RDMA, and GPU-to-GPU communication
- Manage network switches (Mellanox, NVIDIA, Micas Networks)
- Troubleshoot network bottlenecks and latency issues
- Plan and execute network upgrades and expansions
- Network security implementation (firewalls, VLANs, ACLs)
- Collaborate on storage network optimizationInfrastructure monitoring
- 4+ years of network engineering experience in production environments
- Strong understanding of L2/L3 networking protocols (TCP/IP, BGP, OSPF, VLANs)
- Hands-on experience with high-speed networking (100Gb+ Ethernet and InfiniBand)
- Hands-on experience with network security (firewalls, ACLs, network segmentation)
- Knowledge of HPC network topologies
- Experience with InfiniBand fabrics including RDMA, RoCE, IPoIB
- Strong troubleshooting and problem-solving skills
- Experience in data center environments or AI/ML infrastructure
- Hands-on experience with high-performance Ethernet switches (e.g., Broadcom Tomahawk), and latest InfiniBand switches (e.g., Nvidia/Mellanox)
- Experience optimizing networks for GPU-to-GPU communication
- Experience with open-source firewall solutions (OPNsense, pfSense, or similar)
- Experience with network automation tools
- Understanding of distributed storage networking (Ceph cluster networks)
- Familiarity with network monitoring and observability tools (Prometheus, Grafana)
- Knowledge of multi-site network connectivity and WAN optimization
- Familiarity with cloud networking in at least one platform (AWS, GCP, or Azure) including VPC design, site-to-site VPN configuration, Direct Connect/ExpressRoute/Cloud Interconnect, hybrid cloud connectivity, and cloud-to-datacenter network integration
-
Network Engineer, AI/ML Infrastructure
2 weeks ago
Toronto, Ontario, Canada Boson AI Full time US$150,000 - US$250,000About The RoleWe're seeking an experienced Network Engineer to design, build, and optimize the high-performance networking infrastructure powering our AI/ML operations in Toronto. You'll work at the cutting edge of network technology—managing InfiniBand and ultra-high-speed Ethernet fabrics that connect NVIDIA H100 and A100 GPUs, over 20PB of Ceph storage,...
-
Network Engineer, AI/ML Infrastructure
2 weeks ago
Toronto, Ontario, Canada Boson AI Full timeAbout The Role We're seeking an experienced Network Engineer to design, build, and optimize the high-performance networking infrastructure powering our AI/ML operations in Toronto. You'll work at the cutting edge of network technology—managing InfiniBand and ultra-high-speed Ethernet fabrics that connect NVIDIA H100 and A100 GPUs, over 20PB of Ceph...
-
Site Reliability Engineer, AI/ML Infrastructure
2 weeks ago
Toronto, Ontario, Canada Boson AI Full timeAbout The Role We're looking for a Senior Site Reliability Engineer to help us run one of the most exciting GPU clusters around—our Toronto datacenter packed with NVIDIA H100 and A100 GPUs, over 20PB of Ceph storage, terabit networking, and hundreds of servers. You'll be hands-on with the full lifecycle of HPC infrastructure: planning, building, testing,...
-
HPC Engineer, AI/ML Infrastructure
2 weeks ago
Toronto, Ontario, Canada Boson AI Full time US$150,000 - US$250,000About The RoleWe're looking for a Senior High Performance Computing Engineer to help us run one of the most exciting GPU clusters around—our Toronto datacenter packed with NVIDIA H100 and A100 GPUs, over 20PB of Ceph storage, terabit networking, and hundreds of servers.You'll be hands-on with the full lifecycle of HPC infrastructure: planning, building,...
-
HPC Engineer, AI/ML Infrastructure
2 weeks ago
Toronto, Ontario, Canada Boson AI Full timeAbout The Role We're looking for a Senior High Performance Computing Engineer to help us run one of the most exciting GPU clusters around—our Toronto datacenter packed with NVIDIA H100 and A100 GPUs, over 20PB of Ceph storage, terabit networking, and hundreds of servers You'll be hands-on with the full lifecycle of HPC infrastructure: planning, building,...
-
Toronto, Ontario, Canada Boson AI Full timeAbout The Role We're looking for a Senior Site Reliability Engineer to help us run one of the most exciting GPU clusters around—our Toronto datacenter packed with NVIDIA H100 and A100 GPUs, over 20PB of Ceph storage, terabit networking, and hundreds of servers You'll be hands-on with the full lifecycle of HPC infrastructure: planning, building, testing,...
-
Applied AI Engineer
2 weeks ago
Toronto, Ontario, Canada Boam AI Full timeShip production ML and agentic AI powering market leaders worldwideBoam AI builds managed data solutions that transform messy, unstructured signals from public, private, and proprietary sources into structured, reliable, and always up-to-date intelligence on millions of SMBs and enterprises worldwide. These agentic systems power CRMs, data warehouses, AI...
-
Senior AI/ML Engineer
2 days ago
Toronto, Ontario, Canada Okta Full timeGet to know OktaOkta is The World's Identity Company. We free everyone to safely use any technology, anywhere, on any device or app. Our flexible and neutral products, Okta Platform and Auth0 Platform, provide secure access, authentication, and automation, placing identity at the core of business security and growth.At Okta, we celebrate a variety of...
-
AI/ML Engineer
2 weeks ago
Toronto, Ontario, Canada Infoya Full timeSeeking a highly skilled and motivatedAI/ML Engineerwith expertise inMachine Learning, Statistics, and Generative AIto join our team. The ideal candidate will have extensive experience in building and productionizingdata science/GenAI use casesand a strong understanding ofML Opsand Cloud-based ML orchestration services.As an AI/ML Engineer, you will be...
-
AI/ML Engineer
7 days ago
Toronto, Ontario, Canada The Vanguard Group Full timeVanguard is seeking a talented and motivated AI/ML Engineer to join our team in building agentic systems for IT operations and resilience checking. This role is ideal for early-career professionals who are passionate about AI, autonomous agents, and advanced machine learning techniques. You will work alongside senior data scientists and engineers to develop...