Site Reliability Engineering Manager

2 days ago


Old Toronto, Canada Tbwa ChiatDay Inc Full time

Automate and Optimize Brick and Mortar Retail

Focal Systems is the industry leader in retail AI solutions, revolutionizing brick and mortar retail with deep learning computer vision. As a Silicon Valley-based startup, we have more than doubled in size every year since inception.

Our Mission

We are looking for smart, creative, and passionate individuals who want to help build a great and enduring company. Our mission is to deploy deep learning to the world and automate and optimize brick and mortar retail using advanced technology.

About Us

We pride ourselves on recruiting exceptional individuals to help us redefine the state-of-the-art. Our team consists of hard-working, fun-loving professionals from renowned universities, research labs, and tech companies. We care deeply about the health, happiness, and wellbeing of all our employees.

Job Description

The Senior Site Reliability Engineer will be responsible for setting up and managing blue/green and canary deployments to ensure smooth launches without downtime. This role also involves managing distributed services, ensuring comprehensive test coverage, tracking logs, and maintaining 99% uptime. Additionally, the successful candidate will work with Backend, Frontend, and Deep Learning teams to write infrastructure automation code for their needs.

Responsibilities

  • Set up and manage blue/green and canary deployments to ensure seamless launches without downtime.
  • Manage various distributed services, ensuring continuous operation and monitoring.
  • Work with cross-functional teams to develop and implement infrastructure automation code.
  • Identify scalability bottlenecks through load testing and plan infrastructure architecture.
  • Create tools for data access and transparency across various geographic locations and data formats.
  • Design, build, and maintain a robust Continuous Integration and Continuous Deployment (CI/CD) pipeline.

Requirements

  • Solid experience in an infrastructure or Site Reliability Engineer (SRE) role.
  • In-depth knowledge of SQL, networking, distributed systems, operating systems (Debian), and software engineering practices.
  • Terraform or other Infrastructure as Code automation solution expertise.
  • Experience with relational SQL databases and Redis at terabyte scale.
  • Proven track record in setting up monitoring/alerting and reliability engineering.
  • Proficiency in scripting languages such as Python.
  • Able to handle 12-hour on-call rotations.
  • Experience with complex load testing scenarios and automation setup.
  • Tuning Deep Learning pipelines with Python, PyTorch, and Multiprocessing.
  • Backend programming skills with Python.

Estimated Salary: $150,000 - $200,000 per annum



  • Old Toronto, Canada TD Full time

    Job OverviewWe are seeking a highly skilled Site Reliability Engineering Lead to join our team at TD. As a key member of our technology group, you will be responsible for ensuring the stability, scalability, and reliability of our platforms.About the RoleThe ideal candidate will have a minimum of 8 years of experience in site reliability engineering, with a...


  • Toronto, Canada CB Canada Full time

    Site Reliability Engineer On behalf of our client in the Banking Sector, PROCOM is looking for a Site Reliability Engineer. Site Reliability Engineer – Job Description Azure cloud Jira and confluence CICD Experience with automating (provisioning, configuration management, deployment) and integrating Azure PaaS solutions (Azure App services, Azure...


  • Old Toronto, Canada Street Context Full time

    p>Are you a Site Reliability Engineer that has a passion for building reliable, resilient and performant systems that scale? p>We are on a mission to build and strengthen our engineering teams to match the accelerating success of Street Context. We provide a premium Email, Analytics and Broker Relationship platform, purpose-built for capital markets and...


  • Old Toronto, Canada Soda Full time

    Job Description Job Title: Site Reliability Engineer Location: Poland - Fully Remote Salary: 324K PLN or 27.3K monthly Start: ASAP Stack: AWS, Docker, Kubernetes, Terraform, Jenkins, Ansible, Linux, JavaScript, and Lambda. Are you a seasoned DevOps/SRE professional passionate about building high-performance, scalable systems? I am working with a Media/IT...


  • Old Toronto, Canada Thomson Reuters Full time

    h3>(Canada) Site Reliability Engineer (Contract)Contract (9 months 4 days)Published 3 days agoNew RelicData DogSite Reliability Engineer - in the Service Management OrganizationDo you have experience in IT Service Management, working with cloud providers, software development, and technology infrastructure?The Site Reliability Engineer will analyze chronic...


  • Old Toronto, Canada Mastech Inc. Full time

    Mastech Digital is an IT Staffing and Digital Transformation Services company.Mastech Digital provides digital and mainstream technology staff as well as Digital Transformation Services for all American Corporations. We are currently seeking a Site Reliability Engineer (GCP) for our client in the Consulting domain. We value our professionals, providing...


  • Old Toronto, Canada Infotree Global Solutions Full time

    About Infotree Global SolutionsInfotree Global Solutions is a leading provider of innovative solutions, and we're seeking an experienced Site Reliability Engineer to lead our team.Your RoleAs our Site Reliability Engineering Lead, you will be responsible for supervising a team of skilled engineers and ensuring the reliability and scalability of our global...


  • Old Toronto, Canada Sentry Full time

    p>The Site Reliability Engineering team is responsible for the deployment, configuration, maintenance, and monitoring of Sentry's hosted platform. We do this by leveraging automation tools to automatically spin up and scale services to meet the traffic demands of 1,000,000+ developers. Sentry receives over a billion events a day and processes terabytes of...


  • Old Toronto, Canada The Home Depot Canada Full time

    With a career at The Home Depot, you can be yourself and also be part of something bigger.Position Overview:The Manager, SRE will lead a team of Site Reliability Engineers to ensure the reliability, performance, and operational support of our eCommerce systems, with a focus on Google Cloud Platform (GCP) environments. This role requires a strong background...


  • Old Toronto, Canada Tecsys Inc. Full time

    p>Having recognized the advantages of remote work, including employee morale, productivity, reduced commuting on employee wellbeing and the environment, we are proud to be a digital-first company. The technologies and programs in which we invested have provided a fantastic foundation to this end. Our digital-first work environment, together with our...


  • Old Toronto, Canada The Home Depot Canada Full time

    With a career at The Home Depot, you can be yourself and also be part of something bigger.Position Overview:The Manager, SRE will lead a team of Site Reliability Engineers to ensure the reliability, performance, and operational support of our eCommerce systems, with a focus on Google Cloud Platform (GCP) environments. This role requires a strong background...


  • Old Toronto, Canada The Home Depot Full time

    With a career at The Home Depot, you can be yourself and also be part of something bigger.Position Overview:The Manager, SRE will lead a team of Site Reliability Engineers to ensure the reliability, performance, and operational support of our eCommerce systems, with a focus on Google Cloud Platform (GCP) environments. This role requires a strong background...


  • Old Toronto, Canada The Home Depot Full time

    With a career at The Home Depot, you can be yourself and also be part of something bigger.Position Overview:The Manager, SRE will lead a team of Site Reliability Engineers to ensure the reliability, performance, and operational support of our eCommerce systems, with a focus on Google Cloud Platform (GCP) environments. This role requires a strong background...


  • Old Toronto, Canada Olx Full time

    p>Site Reliability EngineerRemote Poland, PolandOLX – Engineering / Full-time / Remote At OLX, we work together to build a more sustainable world through trade. We make it safe, smart, and convenient to buy and sell cars, find housing, get jobs, buy and sell household goods, and more. Our colleagues around the world help to serve millions of people around...


  • Old Toronto, Canada RBC Full time

    About the RoleWe are seeking an experienced Senior Site Reliability Engineer to join our US Cash Management Technology team at RBC. As a key member of our team, you will be responsible for leading the development, implementation, and support of Site Reliability Engineering (SRE) solutions for applications supported by the Commercial, Core Banking, and...


  • Old Toronto, Canada Ascend Fundraising Solutions Full time

    We are currently seeking a full-time Site Reliability Engineer to join our IT team. In this role, you will collaborate closely with the client services team to diagnose, troubleshoot, and resolve issues related to system reliability.RESPONSIBILITIES:Take ownership of customer-reported issues and see problems through to resolution.Develop preventive measures...


  • Old Toronto, Canada Tecsys Full time

    Tecsys is a fast-growing innovator offering supply chain solutions to industry-leading healthcare systems, hospitals, and pharmacy businesses to distributors, retailers, and 3PLs. As a Cloud Infrastructure Specialist, you will be responsible for ensuring the reliability and uptime of our platform and applications in a data-driven way to support internal and...


  • Old Toronto, Canada Tecsys Full time

    p>Having recognized the advantages of remote work, including employee morale, productivity, reduced commuting on employee wellbeing and the environment, we are proud to be a digital-first company. The technologies and programs in which we invested have provided a fantastic foundation to this end. Our digital-first work environment, together with our...


  • Old Toronto, Canada https:www.energyjobline.comsitemap.xml Full time

    Product: Global Platform Engineering Your role: Supervise a team of Site Reliability Engineers Report metrics on application performance and incidents Act proactively and responsively to infrastructure and application failures Build and automate failover and recovery workflows Implement observability and monitoring stack for infrastructure and application...


  • Toronto, Ontario, Canada Royal Bank of Canada Full time

    Royal Bank of Canada is seeking a highly skilled Site Reliability Engineering (SRE) leader to join our team in Toronto, Canada. As an SRE leader, you will be responsible for leading the development and implementation of SRE solutions that improve the reliability and performance of our applications.The ideal candidate will have 5+ years of experience as a...