Senior ML Systems Engineer, Frameworks
5 days ago
Who are we?
Our mission is to scale intelligence to serve humanity. We're training and deploying frontier models for developers and enterprises who are building AI systems to power magical experiences like content generation, semantic search, RAG, and agents. We believe that our work is instrumental to the widespread adoption of AI.
We obsess over what we build. Each one of us is responsible for contributing to increasing the capabilities of our models and the value they drive for our customers. We like to work hard and move fast to do what's best for our customers.
Cohere is a team of researchers, engineers, designers, and more, who are passionate about their craft. Each person is one of the best in the world at what they do. We believe that a diverse range of perspectives is a requirement for building great products.
Join us on our mission and shape the future
We're looking for a senior engineer to help build, maintain and evolve the training framework that powers our frontier-scale language models. This role sits at the intersection of large-scale training, distributed systems, and HPC infrastructure. You will design and maintain the core components that enable fast, reliable, and scalable model training — and build the tooling that connects research ideas to thousands of GPUs.
If you enjoy working across the full stack of ML systems, this role gives you the opportunity and autonomy to have massive impact.
What You'll Work OnBuild and own the training framework responsible for large-scale LLM training.
Design distributed training abstractions (data/tensor/pipeline parallelism, FSDP/ZeRO strategies, memory management, checkpointing).
Improve training throughput and stability on multi-node clusters (e.g., GB200/300, AMD, H200/100).
Develop and maintain tooling for monitoring, logging, debugging, and developer ergonomics.
Collaborate closely with infra teams to ensure Slurm setups, container environments, and hardware configurations support high-performance training.
Investigate and resolve performance bottlenecks across the ML systems stack.
Build robust systems that ensure reproducible, debuggable, large-scale runs.
Strong engineering experience in large-scale distributed training or HPC systems.
Deep familiarity with JAX internals, distributed training libraries, or custom kernels/fused ops.Experience with multi-node cluster orchestration (Slurm, Ray, Kubernetes, or similar).
Comfort debugging performance issues across CUDA/NCCL, networking, IO, and data pipelines.
Experience working with containerized environments (Docker, Singularity/Apptainer).
A track record of building tools that increase developer velocity for ML teams.
Excellent judgment around trade-offs: performance vs complexity, research velocity vs maintainability.
Strong collaboration skills — you'll work closely with infra, research, and deployment teams.
Experience with training LLMs or other large transformer architectures.
Contributions to ML frameworks (PyTorch, JAX, DeepSpeed, Megatron, xFormers, etc.).
Familiarity with evaluation and serving frameworks (vLLM, TensorRT-LLM, custom KV caches).
Experience with data pipeline optimization, sharded datasets, or caching strategies.
Background in performance engineering, profiling, or low-level systems.
Bonus: paper at top-tier venues (such as NeurIPS, ICML, ICLR, AIStats, MLSys, JMLR, AAAI, Nature, COLING, ACL, EMNLP).
You'll work on some of the most challenging and consequential ML systems problems today.
You'll collaborate with a world-class team working fast and at scale.
You'll have end-to-end ownership over critical components of the training stack.
You'll shape the next generation of infrastructure for frontier-scale models.
You'll build tools and systems that directly accelerate research and model quality.
Sample Projects:
Build a high-performance data loading and caching pipeline.
Implement performance profiling across the ML systems stack
Develop internal metrics and monitoring for training runs.
Build reproducibility and regression testing infrastructure.
Develop a performant fault-tolerant distributed checkpointing system.
If some of the above doesn't line up perfectly with your experience, we still encourage you to apply
We value and celebrate diversity and strive to create an inclusive work environment for all. We welcome applicants from all backgrounds and are committed to providing equal opportunities. Should you require any accommodations during the recruitment process, please submit an Accommodations Request Form, and we will work together to meet your needs.
Full-Time Employees at Cohere enjoy these Perks:
An open and inclusive culture and work environment
Work closely with a team on the cutting edge of AI research
Weekly lunch stipend, in-office lunches & snacks
Full health and dental benefits, including a separate budget to take care of your mental health
100% Parental Leave top-up for up to 6 months
Personal enrichment benefits towards arts and culture, fitness and well-being, quality time, and workspace improvement
Remote-flexible, offices in Toronto, New York, San Francisco, London and Paris, as well as a co-working stipend
6 weeks of vacation (30 working days)
-
Senior, ML Engineer
1 day ago
Montreal, Quebec, Canada Torc Robotics Full time $141,500 - $212,300About the Company At Torc, we have always believed that autonomous vehicle technology will transform how we travel, move freight, and do business. A leader in autonomous driving since 2007, Torc has spent over a decade commercializing our solutions with experienced partners. Now a part of the Daimler family, we are focused solely on developing software...
-
Senior Machine Learning Engineer
6 days ago
Montreal, Quebec, Canada EQUISOFT Full time $120,000 - $180,000 per yearWhat is Equisoft? Equisoft is a global provider of digital solutions for insurance and investment, recognized by over 250 of the world's leading financial institutions. We offer a comprehensive ecosystem of scalable solutions that help our customers meet all the challenges brought about by this era of digital transformation, thanks to our business...
-
Senior Data Engineer
6 days ago
Montreal, Quebec, Canada Equisoft Full time $100,000 - $120,000 per yearWhat is Equisoft?Equisoft is a global provider of digital solutions for insurance and investment, recognized by over 250 of the world's leading financial institutions. We offer a comprehensive ecosystem of scalable solutions that help our customers meet all the challenges brought about by this era of digital transformation, thanks to our business...
-
Senior Data Engineer
7 days ago
Montreal, Quebec, Canada EQUISOFT Full time $120,000 - $180,000 per yearWhat is Equisoft? Equisoft is a global provider of digital solutions for insurance and investment, recognized by over 250 of the world's leading financial institutions. We offer a comprehensive ecosystem of scalable solutions that help our customers meet all the challenges brought about by this era of digital transformation, thanks to our business...
-
Senior Machine Learning Engineer
1 week ago
Montreal, Quebec, Canada Equisoft Full time $120,000 - $180,000 per yearWhat is Equisoft? Equisoft is a global provider of digital solutions for insurance and investment, recognized by over 250 of the world's leading financial institutions. We offer a comprehensive ecosystem of scalable solutions that help our customers meet all the challenges brought about by this era of digital transformation, thanks to our business...
-
Bilingual Senior Software QA Engineer
1 week ago
Montreal, Quebec, Canada un emploi de Bilingual Senior Software QA Engineer chez TTC Global Full time $90,000 - $120,000 per yearAbout TTCThe Testing Consultancy (TTC) is a global specialist software testing company with a focus on helping organizations transform the way they deliver quality software. We have broad capabilities across a wide range of testing areas that enable our clients to increase the speed and quality of software development while reducing risk and cost. Perks of...
-
Senior Data Engineer
1 week ago
Montreal, Quebec, Canada Equisoft Full time $120,000 - $140,000 per yearWhat is Equisoft? Equisoft is a global provider of digital solutions for insurance and investment, recognized by over 250 of the world's leading financial institutions. We offer a comprehensive ecosystem of scalable solutions that help our customers meet all the challenges brought about by this era of digital transformation, thanks to our business...
-
Senior Machine Learning Engineer
1 week ago
Montreal, Quebec, Canada Lightspeed Commerce Full time $125,000 - $175,000 per yearAbout The RoleWe're looking for aSenior Machine Learning Engineerto join Lightspeed'sData Science Enablementteam. The focus of the team is to build the tools, frameworks, and best practices that support building production ML/AI solutions at Lightspeed. An example of a data science use case is in collaboration with Lightspeed Capital. This group powers our...
-
Senior MLOps Engineer
24 hours ago
Montreal, Quebec, Canada Jesta I.S. Full time $90,000 - $120,000 per yearAbout the RoleJesta I.S. builds enterprise retail technology used by apparel and footwear brands with complex, multi-site operations. Our data environment spans ERP and cloud platforms, and our engineering culture is hands-on, pragmatic, and fast-moving. You'll work in a production environment that integrates Oracle, Snowflake, and AWS, supported by strong...
-
Senior MLOps
2 weeks ago
Montreal, Quebec, Canada HiringBranch Full time $120,000 - $180,000 per yearAbout HiringBranchAt HiringBranch, we're redefining the talent acquisition game by designing conversational assessments that let candidates demonstrate their skills. We've proven that technology can assess soft skills more accurately and fairly than people can. Our mission is to help hiring teams make excellent hiring choices; ethically, effortlessly, and...