Manager, site reliability engineering

1 month ago


Old Toronto, Canada The Home Depot Canada Full time

With a career at The Home Depot, you can be yourself and also be part of something bigger.

Position Overview:
The Manager, SRE will lead a team of Site Reliability Engineers to ensure the reliability, performance, and operational support of our eCommerce systems, with a focus on Google Cloud Platform (GCP) environments. This role requires a strong background in reliability reviews, performance engineering practices, production engineering, and operational support, with emphasis on DevOps principles and GCP expertise.

Responsibilities:

  1. Leadership & Management:
    Lead and mentor a team of Site Reliability Engineers.
    Foster a culture of continuous improvement and innovation.
    Collaborate with cross-functional teams to align SRE practices with business objectives.
  2. Reliability & Performance:
    Conduct reliability reviews to identify areas for improvement and implement solutions to enhance system reliability, particularly in GCP environments.
    Implement and promote performance engineering practices to ensure optimal system performance on GCP.
    Develop and maintain service level objectives (SLOs) and error budgets.
  3. Production Engineering & Operational Support:
    Oversee production engineering efforts to ensure systems are designed for operational excellence and reliability, leveraging GCP services and best practices.
    Manage incident response and post-incident reviews to minimize downtime and improve system resilience.
    Implement monitoring, alerting, and observability solutions to proactively identify and address issues.
    Develop and maintain runbooks and playbooks for common operational tasks.
    Coordinate with security teams to ensure compliance with security policies and best practices.
  4. DevOps & Continuous Improvement:
    Drive DevOps initiatives to improve collaboration between development and operations teams, with a focus on GCP-native tools and services.
    Implement and maintain CI/CD pipelines to streamline deployment processes in GCP environments.
    Identify and implement automation opportunities to reduce manual tasks and improve efficiency.
    Promote the use of Infrastructure as Code (IaC) to manage and provision cloud resources.
    Continuously evaluate and integrate new tools and technologies to enhance DevOps practices.
  5. Release Management:
    Implement and maintain release management best practices to minimize disruptions and maximize system stability.
    Collaborate with DevOps teams to integrate release management into CI/CD pipelines.
    Oversee release schedules, ensuring minimal impact on business operations.
    Ensure there is a rigorous release readiness process in place that includes reviews and post-release retrospectives.
    Maintain a release calendar and communicate release plans to stakeholders.
  6. Strategic Planning:
    Create and maintain a strategic roadmap for SRE initiatives, aligning with business goals and technological advancements.
    Refine and standardize Standard Operating Procedures (SOPs) to enhance operational efficiency and consistency.
    Address customer pain points by developing and implementing solutions that improve user experience and system reliability.
    Engage with stakeholders to understand their needs and incorporate feedback into strategic planning and execution.
    Monitor industry trends and best practices to ensure the SRE team remains at the forefront of technology.

Experience:

  1. Bachelor’s degree in computer science, Engineering, or a related field.
  2. Strong problem-solving and analytical abilities.
  3. Excellent communication and collaboration skills.
  4. 4-6 years of relevant work experience, including significant experience with GCP.
  5. Extensive experience with cloud infrastructure, GCP services, and architecture.
  6. Proven track record of managing and optimizing large-scale systems on GCP.
  7. Proven ability to effectively communicate with individuals at all levels of the organization.
  8. Ability to maintain relationships and negotiate with vendors.
  9. Ability to operate in and leverage resources in a matrixed environment.
  10. Ability to analyze and present data to support ideas.
  11. Ability to clearly communicate to all levels of the organization.
#J-18808-Ljbffr

  • Old Toronto, Canada Lorien Full time

    Hybrid - Manchester We are currently working with a leading gambling company dedicated to providing exceptional gaming experiences. They are looking for an experienced Site Reliability Engineer with a strong skill set in system reliability to join its world-class technology team. This role is ideal for someone who has 4+ years of experience within the...


  • Old Toronto, Canada TD Bank Full time

    Site Reliability Engineer Site Reliability Engineer Work Location: Canada Hours: 37.5 Line of Business: Technology Solutions Pay Details: We’re committed to providing fair and equitable compensation to all our colleagues. As a candidate, we encourage you to have an open dialogue with a member of


  • Old Toronto, Canada Lorien Full time

    p>Hybrid - ManchesterWe are currently working with a leading gambling company dedicated to providing exceptional gaming experiences. They are looking for an experienced Site Reliability Engineer with a strong skill set in system reliability to join its world-class technology team. This role is ideal for someone who has 4+ years of experience within the...


  • Old Toronto, Canada Street Context Full time

    Are you a Site Reliability Engineer that has a passion for building reliable, resilient and performant systems that scale ? Do you command with a steady hand when incidents unfold? Are you motivated by team success ? If so, continue reading… We are on a mission to build and strengthen our engineering teams to match the accelerating success of Street...


  • Old Toronto, Canada Street Context Full time

    p>Are you a Site Reliability Engineer that has a passion for building reliable, resilient and performant systems that scale? p>We are on a mission to build and strengthen our engineering teams to match the accelerating success of Street Context. We provide a premium Email, Analytics and Broker Relationship platform, purpose-built for capital markets and...


  • Toronto, Canada CB Canada Full time

    Site Reliability Engineer On behalf of our client in the Banking Sector, PROCOM is looking for a Site Reliability Engineer. Site Reliability Engineer – Job Description Azure cloud Jira and confluence CICD Experience with automating (provisioning, configuration management, deployment) and integrating Azure PaaS solutions (Azure App services, Azure...


  • Old Toronto, Canada Street Context Full time

    p>Are you a Site Reliability Engineer that has a passion for building reliable, resilient and performant systems that scale? p>We are on a mission to build and strengthen our engineering teams to match the accelerating success of Street Context. We provide a premium Email, Analytics and Broker Relationship platform, purpose-built for capital markets and...


  • Old Toronto, Canada Sentry Full time

    About the role The Site Reliability Engineering team is responsible for the deployment, configuration, maintenance, and monitoring of Sentry's hosted platform. We do this by leveraging automation tools to automatically spin up and scale services to meet the traffic demands of 1,000,000+ developers.


  • Old Toronto, Canada Soda Full time

    Job Description Job Title: Site Reliability Engineer Location: Poland - Fully Remote Salary: 324K PLN or 27.3K monthly Start: ASAP Stack: AWS, Docker, Kubernetes, Terraform, Jenkins, Ansible, Linux, JavaScript, and Lambda. Are you a seasoned DevOps/SRE professional passionate about building high-performance, scalable systems? I am working with a Media/IT...


  • Old Toronto, Canada Thomson Reuters Full time

    h3>(Canada) Site Reliability Engineer (Contract)Contract (9 months 4 days)Published 3 days agoNew RelicData DogSite Reliability Engineer - in the Service Management OrganizationDo you have experience in IT Service Management, working with cloud providers, software development, and technology infrastructure?The Site Reliability Engineer will analyze chronic...


  • Old Toronto, Canada Thomson Reuters Full time

    h3>(Canada) Site Reliability Engineer (Contract)Contract (5 months 29 days)Published 8 months agoCLOSEDGCPSite Reliability Engineer - in the Service Management OrganizationDo you have experience in IT Service Management, working with cloud providers, software development, and technology infrastructure?The Site Reliability Engineer will analyze chronic and...


  • Old Toronto, Canada Mastech Inc. Full time

    Mastech Digital is an IT Staffing and Digital Transformation Services company.Mastech Digital provides digital and mainstream technology staff as well as Digital Transformation Services for all American Corporations. We are currently seeking a Site Reliability Engineer (GCP) for our client in the Consulting domain. We value our professionals, providing...


  • Old Toronto, Canada Infotree Global Solutions Full time

    About Infotree Global SolutionsInfotree Global Solutions is a leading provider of innovative solutions, and we're seeking an experienced Site Reliability Engineer to lead our team.Your RoleAs our Site Reliability Engineering Lead, you will be responsible for supervising a team of skilled engineers and ensuring the reliability and scalability of our global...


  • Old Toronto, Canada Sentry Full time

    Bad software is everywhere, and we’re tired of it. Sentry is on a mission to help developers write better software faster, so we can get back to enjoying technology. With more than $217 million in funding and 100,000+ organizations that believe we’re on to something, we're building performance and error monitoring tools that help companies like Disney,...


  • Old Toronto, Canada Sentry Full time

    p>The Site Reliability Engineering team is responsible for the deployment, configuration, maintenance, and monitoring of Sentry's hosted platform. We do this by leveraging automation tools to automatically spin up and scale services to meet the traffic demands of 1,000,000+ developers. Sentry receives over a billion events a day and processes terabytes of...


  • Old Toronto, Canada CentML Full time

    At CentML, we are seeking a talented Site Reliability Engineer - Automation to join our team.We have a strong founding team that includes experts in AI, compilers, and ML hardware. Our co-founder and CEO, Gennady Pekhimenko, is a world-renowned expert in ML systems who has received multiple academic and industry research awards from top tech companies.As a...


  • Old Toronto, Canada The Home Depot Canada Full time

    With a career at The Home Depot, you can be yourself and also be part of something bigger.Position Overview:The Manager, SRE will lead a team of Site Reliability Engineers to ensure the reliability, performance, and operational support of our eCommerce systems, with a focus on Google Cloud Platform (GCP) environments. This role requires a strong background...


  • Old Toronto, Canada The Home Depot Full time

    With a career at The Home Depot, you can be yourself and also be part of something bigger.Position Overview:The Manager, SRE will lead a team of Site Reliability Engineers to ensure the reliability, performance, and operational support of our eCommerce systems, with a focus on Google Cloud Platform (GCP) environments. This role requires a strong background...


  • Old Toronto, Canada Tecsys Full time

    Having recognized the advantages of remote work, including employee morale, productivity, reduced commuting on employee wellbeing and the environment, we are proud to be a digital-first company. The technologies and programs in which we invested have provided a fantastic foundation to this end. Our digital-first work environment, together with our...


  • Old Toronto, Canada Olx Full time

    p>Site Reliability EngineerRemote Poland, PolandOLX – Engineering / Full-time / Remote At OLX, we work together to build a more sustainable world through trade. We make it safe, smart, and convenient to buy and sell cars, find housing, get jobs, buy and sell household goods, and more. Our colleagues around the world help to serve millions of people around...