Senior ML Systems Engineer, Frameworks
3 weeks ago
Join to apply for the Senior ML Systems Engineer, Frameworks & Tooling role at Cohere Be among the first 25 applicants Who are we? Our mission is to scale intelligence to serve humanity. We’re training and deploying frontier models for developers and enterprises who are building AI systems to power magical experiences like content generation, semantic search, RAG, and agents. We believe that our work is instrumental to the widespread adoption of AI. We obsess over what we build. Each one of us is responsible for contributing to increasing the capabilities of our models and the value they drive for our customers. We like to work hard and move fast to do what’s best for our customers. Cohere is a team of researchers, engineers, designers, and more, who are passionate about their craft. Each person is one of the best in the world at what they do. We believe that a diverse range of perspectives is a requirement for building great products. Join us on our mission and shape the future We’re looking for a senior engineer to help build, maintain and evolve the training framework that powers our frontier-scale language models. This role sits at the intersection of large-scale training, distributed systems, and HPC infrastructure. You will design and maintain the core components that enable fast, reliable, and scalable model training — and build the tooling that connects research ideas to thousands of GPUs. If you enjoy working across the full stack of ML systems, this role gives you the opportunity and autonomy to have massive impact. What You’ll Work On Build and own the training framework responsible for large-scale LLM training. Design distributed training abstractions (data/tensor/pipeline parallelism, FSDP/ZeRO strategies, memory management, checkpointing). Improve training throughput and stability on multi-node clusters (e.g., GB200/300, AMD, H200/100). Develop and maintain tooling for monitoring, logging, debugging, and developer ergonomics. Collaborate closely with infra teams to ensure Slurm setups, container environments, and hardware configurations support high-performance training. Investigate and resolve performance bottlenecks across the ML systems stack. Build robust systems that ensure reproducible, debuggable, large-scale runs. You Might Be a Good Fit If You Have Strong engineering experience in large-scale distributed training or HPC systems. Deep familiarity with JAX internals, distributed training libraries, or custom kernels/fused ops. Experience with multi-node cluster orchestration (Slurm, Ray, Kubernetes, or similar). Comfort debugging performance issues across CUDA/NCCL, networking, IO, and data pipelines. Experience working with containerized environments (Docker, Singularity/Apptainer). A track record of building tools that increase developer velocity for ML teams. Excellent judgment around trade-offs: performance vs complexity, research velocity vs maintainability. Strong collaboration skills — you’ll work closely with infra, research, and deployment teams. Nice to Have Experience with training LLMs or other large transformer architectures. Contributions to ML frameworks (PyTorch, JAX, DeepSpeed, Megatron, xFormers, etc.). Familiarity with evaluation and serving frameworks (vLLM, TensorRT-LLM, custom KV caches). Experience with data pipeline optimization, sharded datasets, or caching strategies. Background in performance engineering, profiling, or low-level systems. Bonus: paper at top-tier venues (such as NeurIPS, ICML, ICLR, AIStats, MLSys, JMLR, AAAI, Nature, COLING, ACL, EMNLP). Why Join Us You’ll work on some of the most challenging and consequential ML systems problems today. You’ll collaborate with a world-class team working fast and at scale. You’ll have end-to-end ownership over critical components of the training stack. You’ll shape the next generation of infrastructure for frontier-scale models. You’ll build tools and systems that directly accelerate research and model quality. Sample Projects Build a high-performance data loading and caching pipeline. Implement performance profiling across the ML systems stack Develop internal metrics and monitoring for training runs. Build reproducibility and regression testing infrastructure. Develop a performant fault-tolerant distributed checkpointing system. We value and celebrate diversity and strive to create an inclusive work environment for all. We welcome applicants from all backgrounds and are committed to providing equal opportunities. Should you require any accommodations during the recruitment process, please submit an Accommodations Request Form, and we will work together to meet your needs. Full-Time Employees At Cohere Enjoy These Perks An open and inclusive culture and work environment Work closely with a team on the cutting edge of AI research Weekly lunch stipend, in-office lunches & snacks Full health and dental benefits, including a separate budget to take care of your mental health 100% Parental Leave top-up for up to 6 months Personal enrichment benefits towards arts and culture, fitness and well-being, quality time, and workspace improvement Remote-flexible, offices in Toronto, New York, San Francisco, London and Paris, as well as a co-working stipend 6 weeks of vacation (30 working days) Referrals increase your chances of interviewing at Cohere by 2x Get notified about new Senior System Engineer jobs in Toronto, Ontario, Canada . #J-18808-Ljbffr
-
Senior ML Systems Engineer, Frameworks
7 hours ago
Toronto, Canada Cohere Full timeJoin to apply for the Senior ML Systems Engineer, Frameworks & Tooling role at CohereBe among the first 25 applicantsWho are we?Our mission is to scale intelligence to serve humanity. We’re training and deploying frontier models for developers and enterprises who are building AI systems to power magical experiences like content generation, semantic search,...
-
Senior ML Systems Engineer: Frameworks
3 weeks ago
Toronto, Canada Cohere Full timeA leading AI company in Toronto seeks a Senior ML Systems Engineer, Frameworks & Tooling. In this role, you'll design and maintain the training framework for large-scale language models. You will work on distributed systems and HPC infrastructure, improving training throughput and stability. Ideal candidates have strong experience in large-scale training,...
-
Senior ML Systems Engineer: Frameworks
7 hours ago
Toronto, Canada Cohere Full timeA leading AI company in Toronto seeks a Senior ML Systems Engineer, Frameworks & Tooling. In this role, you'll design and maintain the training framework for large-scale language models. You will work on distributed systems and HPC infrastructure, improving training throughput and stability. Ideal candidates have strong experience in large-scale training,...
-
Sr. Inference ML Runtime Engineer
3 weeks ago
Toronto, Canada Cerebras Systems Full timeCerebras Systems builds the world's largest AI chip, 56 times larger than GPUs. Our novel wafer-scale architecture provides the AI compute power of dozens of GPUs on a single chip, with the programming simplicity of a single device. This approach allows Cerebras to deliver industry-leading training and inference speeds and empowers machine learning users to...
-
Senior AI Engineer
3 weeks ago
Toronto, Canada Refinitiv Full timeA leading technology firm is seeking a Senior Software Engineer - AI II to design and implement complex AI systems, ensuring quality and compliance. This role demands collaboration with cross-functional teams to align AI features with product goals. The ideal candidate has extensive experience in software engineering and proficiency in Python and AI/ML...
-
Senior AI/ML Engineer
14 minutes ago
Toronto, Canada PRAGMATIKE Full timeLocation: Canada Start date: ASAP Languages: English (required) About the Role Pragmatike is hiring on behalf of a confidential client for a Senior AI/ML Engineer role on a high-impact AI project. This is a production-focused position for engineers who build real AI systems, not research prototypes or demo applications. You will work across the full AI...
-
Senior ML Platform Engineer
7 hours ago
Toronto, Canada Rakuten Kobo Inc. Full timefor more information.Senior Machine Learning Engineer page is loaded## Senior Machine Learning Engineerlocations: Toronto, Canadatime type: Full timeposted on: Posted Yesterdayjob requisition id: 1030578**Job Description:**Here at Rakuten Kobo Inc. we offer a casual working start-up environment and a group of friendly and talented individuals. Our...
-
Senior ML Performance Engineer
3 weeks ago
Toronto, Canada Lemurian Labs Inc. Full timeAbout Us At Lemurian Labs, we're on a mission to bring the power of AI to everyone—without leaving a massive environmental footprint. We care deeply about the impact AI has on our society and planet, and we're building a solid foundation for its future, ensuring AI grows sustainably and responsibly. Innovation should help the world, not harm it. We are...
-
Senior ML Performance Engineer
4 weeks ago
Toronto, Canada Lemurian Labs Inc. Full timeAbout Us At Lemurian Labs, we're on a mission to bring the power of AI to everyone—without leaving a massive environmental footprint. We care deeply about the impact AI has on our society and planet, and we're building a solid foundation for its future, ensuring AI grows sustainably and responsibly. Innovation should help the world, not harm it. We are...
-
Senior ML Performance Engineer
7 hours ago
Toronto, Canada Lemurian Labs Inc. Full timeAbout Us At Lemurian Labs, we're on a mission to bring the power of AI to everyone—without leaving a massive environmental footprint. We care deeply about the impact AI has on our society and planet, and we're building a solid foundation for its future, ensuring AI grows sustainably and responsibly. Innovation should help the world, not harm it. We are...