Site Reliability Engineering Manager
2 days ago
Automate and Optimize Brick and Mortar Retail
Focal Systems is the industry leader in retail AI solutions, revolutionizing brick and mortar retail with deep learning computer vision. As a Silicon Valley-based startup, we have more than doubled in size every year since inception.
Our Mission
We are looking for smart, creative, and passionate individuals who want to help build a great and enduring company. Our mission is to deploy deep learning to the world and automate and optimize brick and mortar retail using advanced technology.
About Us
We pride ourselves on recruiting exceptional individuals to help us redefine the state-of-the-art. Our team consists of hard-working, fun-loving professionals from renowned universities, research labs, and tech companies. We care deeply about the health, happiness, and wellbeing of all our employees.
Job Description
The Senior Site Reliability Engineer will be responsible for setting up and managing blue/green and canary deployments to ensure smooth launches without downtime. This role also involves managing distributed services, ensuring comprehensive test coverage, tracking logs, and maintaining 99% uptime. Additionally, the successful candidate will work with Backend, Frontend, and Deep Learning teams to write infrastructure automation code for their needs.
Responsibilities
- Set up and manage blue/green and canary deployments to ensure seamless launches without downtime.
- Manage various distributed services, ensuring continuous operation and monitoring.
- Work with cross-functional teams to develop and implement infrastructure automation code.
- Identify scalability bottlenecks through load testing and plan infrastructure architecture.
- Create tools for data access and transparency across various geographic locations and data formats.
- Design, build, and maintain a robust Continuous Integration and Continuous Deployment (CI/CD) pipeline.
Requirements
- Solid experience in an infrastructure or Site Reliability Engineer (SRE) role.
- In-depth knowledge of SQL, networking, distributed systems, operating systems (Debian), and software engineering practices.
- Terraform or other Infrastructure as Code automation solution expertise.
- Experience with relational SQL databases and Redis at terabyte scale.
- Proven track record in setting up monitoring/alerting and reliability engineering.
- Proficiency in scripting languages such as Python.
- Able to handle 12-hour on-call rotations.
- Experience with complex load testing scenarios and automation setup.
- Tuning Deep Learning pipelines with Python, PyTorch, and Multiprocessing.
- Backend programming skills with Python.
Estimated Salary: $150,000 - $200,000 per annum
-
Site Reliability Engineering Lead
7 days ago
Old Toronto, Canada TD Full timeJob OverviewWe are seeking a highly skilled Site Reliability Engineering Lead to join our team at TD. As a key member of our technology group, you will be responsible for ensuring the stability, scalability, and reliability of our platforms.About the RoleThe ideal candidate will have a minimum of 8 years of experience in site reliability engineering, with a...
-
Site Reliability Engineer
6 months ago
Toronto, Canada CB Canada Full timeSite Reliability Engineer On behalf of our client in the Banking Sector, PROCOM is looking for a Site Reliability Engineer. Site Reliability Engineer – Job Description Azure cloud Jira and confluence CICD Experience with automating (provisioning, configuration management, deployment) and integrating Azure PaaS solutions (Azure App services, Azure...
-
AWS Site Reliability Engineer
1 month ago
Old Toronto, Canada Street Context Full timep>Are you a Site Reliability Engineer that has a passion for building reliable, resilient and performant systems that scale? p>We are on a mission to build and strengthen our engineering teams to match the accelerating success of Street Context. We provide a premium Email, Analytics and Broker Relationship platform, purpose-built for capital markets and...
-
AWS Site Reliability Engineer
1 month ago
Old Toronto, Canada Soda Full timeJob Description Job Title: Site Reliability Engineer Location: Poland - Fully Remote Salary: 324K PLN or 27.3K monthly Start: ASAP Stack: AWS, Docker, Kubernetes, Terraform, Jenkins, Ansible, Linux, JavaScript, and Lambda. Are you a seasoned DevOps/SRE professional passionate about building high-performance, scalable systems? I am working with a Media/IT...
-
Site Reliability Engineering Linux or Windows
2 months ago
Old Toronto, Canada Thomson Reuters Full timeh3>(Canada) Site Reliability Engineer (Contract)Contract (9 months 4 days)Published 3 days agoNew RelicData DogSite Reliability Engineer - in the Service Management OrganizationDo you have experience in IT Service Management, working with cloud providers, software development, and technology infrastructure?The Site Reliability Engineer will analyze chronic...
-
Digital Site Reliability Engineer
2 months ago
Old Toronto, Canada Mastech Inc. Full timeMastech Digital is an IT Staffing and Digital Transformation Services company.Mastech Digital provides digital and mainstream technology staff as well as Digital Transformation Services for all American Corporations. We are currently seeking a Site Reliability Engineer (GCP) for our client in the Consulting domain. We value our professionals, providing...
-
Site Reliability Engineering Lead
4 weeks ago
Old Toronto, Canada Infotree Global Solutions Full timeAbout Infotree Global SolutionsInfotree Global Solutions is a leading provider of innovative solutions, and we're seeking an experienced Site Reliability Engineer to lead our team.Your RoleAs our Site Reliability Engineering Lead, you will be responsible for supervising a team of skilled engineers and ensuring the reliability and scalability of our global...
-
AWS Site Reliability Engineer
2 months ago
Old Toronto, Canada Sentry Full timep>The Site Reliability Engineering team is responsible for the deployment, configuration, maintenance, and monitoring of Sentry's hosted platform. We do this by leveraging automation tools to automatically spin up and scale services to meet the traffic demands of 1,000,000+ developers. Sentry receives over a billion events a day and processes terabytes of...
-
Engineering Manager Site Reliability Engineer
2 months ago
Old Toronto, Canada The Home Depot Canada Full timeWith a career at The Home Depot, you can be yourself and also be part of something bigger.Position Overview:The Manager, SRE will lead a team of Site Reliability Engineers to ensure the reliability, performance, and operational support of our eCommerce systems, with a focus on Google Cloud Platform (GCP) environments. This role requires a strong background...
-
AWS Site Reliability Engineer
2 weeks ago
Old Toronto, Canada Tecsys Inc. Full timep>Having recognized the advantages of remote work, including employee morale, productivity, reduced commuting on employee wellbeing and the environment, we are proud to be a digital-first company. The technologies and programs in which we invested have provided a fantastic foundation to this end. Our digital-first work environment, together with our...
-
Manager, site reliability engineering
2 months ago
Old Toronto, Canada The Home Depot Canada Full timeWith a career at The Home Depot, you can be yourself and also be part of something bigger.Position Overview:The Manager, SRE will lead a team of Site Reliability Engineers to ensure the reliability, performance, and operational support of our eCommerce systems, with a focus on Google Cloud Platform (GCP) environments. This role requires a strong background...
-
Old Toronto, Canada The Home Depot Full timeWith a career at The Home Depot, you can be yourself and also be part of something bigger.Position Overview:The Manager, SRE will lead a team of Site Reliability Engineers to ensure the reliability, performance, and operational support of our eCommerce systems, with a focus on Google Cloud Platform (GCP) environments. This role requires a strong background...
-
Engineering Manager Site Reliability Engineer
1 month ago
Old Toronto, Canada The Home Depot Full timeWith a career at The Home Depot, you can be yourself and also be part of something bigger.Position Overview:The Manager, SRE will lead a team of Site Reliability Engineers to ensure the reliability, performance, and operational support of our eCommerce systems, with a focus on Google Cloud Platform (GCP) environments. This role requires a strong background...
-
AWS Site Reliability Engineer
1 month ago
Old Toronto, Canada Olx Full timep>Site Reliability EngineerRemote Poland, PolandOLX – Engineering / Full-time / Remote At OLX, we work together to build a more sustainable world through trade. We make it safe, smart, and convenient to buy and sell cars, find housing, get jobs, buy and sell household goods, and more. Our colleagues around the world help to serve millions of people around...
-
Senior Site Reliability Engineer
7 days ago
Old Toronto, Canada RBC Full timeAbout the RoleWe are seeking an experienced Senior Site Reliability Engineer to join our US Cash Management Technology team at RBC. As a key member of our team, you will be responsible for leading the development, implementation, and support of Site Reliability Engineering (SRE) solutions for applications supported by the Commercial, Core Banking, and...
-
Site Reliability Engineer- Automation
2 months ago
Old Toronto, Canada Ascend Fundraising Solutions Full timeWe are currently seeking a full-time Site Reliability Engineer to join our IT team. In this role, you will collaborate closely with the client services team to diagnose, troubleshoot, and resolve issues related to system reliability.RESPONSIBILITIES:Take ownership of customer-reported issues and see problems through to resolution.Develop preventive measures...
-
AWS Site Reliability Engineer
2 weeks ago
Old Toronto, Canada Tecsys Full timeTecsys is a fast-growing innovator offering supply chain solutions to industry-leading healthcare systems, hospitals, and pharmacy businesses to distributors, retailers, and 3PLs. As a Cloud Infrastructure Specialist, you will be responsible for ensuring the reliability and uptime of our platform and applications in a data-driven way to support internal and...
-
AWS Site Reliability Engineer
1 month ago
Old Toronto, Canada Tecsys Full timep>Having recognized the advantages of remote work, including employee morale, productivity, reduced commuting on employee wellbeing and the environment, we are proud to be a digital-first company. The technologies and programs in which we invested have provided a fantastic foundation to this end. Our digital-first work environment, together with our...
-
Site Reliability Engineer Leader
2 months ago
Old Toronto, Canada https:www.energyjobline.comsitemap.xml Full timeProduct: Global Platform Engineering Your role: Supervise a team of Site Reliability Engineers Report metrics on application performance and incidents Act proactively and responsively to infrastructure and application failures Build and automate failover and recovery workflows Implement observability and monitoring stack for infrastructure and application...
-
Site Reliability Engineering Leader
2 weeks ago
Toronto, Ontario, Canada Royal Bank of Canada Full timeRoyal Bank of Canada is seeking a highly skilled Site Reliability Engineering (SRE) leader to join our team in Toronto, Canada. As an SRE leader, you will be responsible for leading the development and implementation of SRE solutions that improve the reliability and performance of our applications.The ideal candidate will have 5+ years of experience as a...