Manager, Site Reliability Engineering

1 week ago


Old Toronto, Canada The Home Depot Full time

With a career at The Home Depot, you can be yourself and also be part of something bigger.

Position Overview:
The Manager, SRE will lead a team of Site Reliability Engineers to ensure the reliability, performance, and operational support of our eCommerce systems, with a focus on Google Cloud Platform (GCP) environments. This role requires a strong background in reliability reviews, performance engineering practices, production engineering, and operational support, with emphasis on DevOps principles and GCP expertise.

Responsibilities:

  • Leadership & Management:
    • Lead and mentor a team of Site Reliability Engineers
    • Foster a culture of continuous improvement and innovation
    • Collaborate with cross-functional teams to align SRE practices with business objectives
  • Reliability & Performance:
    • Conduct reliability reviews to identify areas for improvement and implement solutions to enhance system reliability, particularly in GCP environments
    • Implement and promote performance engineering practices to ensure optimal system performance on GCP
    • Develop and maintain service level objectives (SLOs) and error budgets
  • Production Engineering & Operational Support:
    • Oversee production engineering efforts to ensure systems are designed for operational excellence and reliability, leveraging GCP services and best practices
    • Manage incident response and post-incident reviews to minimize downtime and improve system resilience
    • Implement monitoring, alerting, and observability solutions to proactively identify and address issues
    • Develop and maintain runbooks and playbooks for common operational tasks
    • Coordinate with security teams to ensure compliance with security policies and best practices
  • DevOps & Continuous Improvement:
    • Drive DevOps initiatives to improve collaboration between development and operations teams, with a focus on GCP-native tools and services
    • Implement and maintain CI/CD pipelines to streamline deployment processes in GCP environments
    • Identify and implement automation opportunities to reduce manual tasks and improve efficiency
    • Promote the use of Infrastructure as Code (IaC) to manage and provision cloud resources
    • Continuously evaluate and integrate new tools and technologies to enhance DevOps practices
  • Release Management:
    • Implement and maintain release management best practices to minimize disruptions and maximize system stability
    • Collaborate with DevOps teams to integrate release management into CI/CD pipelines
    • Oversee release schedules, ensuring minimal impact on business operations
    • Ensure there is a rigorous release readiness process in place that includes reviews and post-release retrospectives
    • Maintain a release calendar and communicate release plans to stakeholders
  • Strategic Planning:
    • Create and maintain a strategic roadmap for SRE initiatives, aligning with business goals and technological advancements
    • Refine and standardize Standard Operating Procedures (SOPs) to enhance operational efficiency and consistency
    • Address customer pain points by developing and implementing solutions that improve user experience and system reliability
    • Engage with stakeholders to understand their needs and incorporate feedback into strategic planning and execution
    • Monitor industry trends and best practices to ensure the SRE team remains at the forefront of technology

Experience:

  • Bachelor’s degree in computer science, Engineering, or a related field
  • Strong problem-solving and analytical abilities
  • Excellent communication and collaboration skills
  • 4-6 years of relevant work experience, including significant experience with GCP
  • Extensive experience with cloud infrastructure, GCP services and architecture
  • Proven track record of managing and optimizing large-scale systems on GCP
  • Proven ability to effectively communicate with individuals at all levels of the organization
  • Ability to maintain relationships and negotiate with vendors
  • Ability to operate in and leverage resources in a matrixed environment
  • Ability to analyze and present data to support ideas
  • Ability to clearly communicate to all levels of the organization
#J-18808-Ljbffr

  • Old Toronto, Canada CB Canada Full time

    Site Reliability Engineer On behalf of our client in the Banking Sector, PROCOM is looking for a Site Reliability Engineer. Site Reliability Engineer – Job Description Azure cloud Jira and Confluence CICD Experience with automating (provisioning, configuration management, deployment) and integratin


  • Old Toronto, Canada Lorien Full time

    Hybrid - Manchester We are currently working with a leading gambling company dedicated to providing exceptional gaming experiences. They are looking for an experienced Site Reliability Engineer with a strong skill set in system reliability to join its world-class technology team. This role is ideal for someone who has 4+ years of experience within the...


  • Old Toronto, Ontario, Canada Thomson Reuters Full time

    Site Reliability EngineerWe are seeking a highly skilled Site Reliability Engineer to join our team at Thomson Reuters. As a Site Reliability Engineer, you will play a critical role in ensuring the reliability and efficiency of our cloud-based infrastructure.About the RoleIn this position, you will be responsible for:Designing and implementing scalable...


  • Old Toronto, Ontario, Canada Thomson Reuters Full time

    Site Reliability EngineerWe are seeking a highly skilled Site Reliability Engineer to join our team at Thomson Reuters. As a Site Reliability Engineer, you will play a critical role in ensuring the reliability and efficiency of our cloud-based infrastructure.About the RoleIn this position, you will be responsible for:Designing and implementing scalable...


  • Old Toronto, Ontario, Canada Thomson Reuters Full time

    Site Reliability EngineerWe are seeking a highly skilled Site Reliability Engineer to join our team at Thomson Reuters. As a Site Reliability Engineer, you will be responsible for ensuring the reliability and scalability of our cloud-based infrastructure.About the RoleIn this role, you will be responsible for:Designing and implementing scalable systems and...


  • Old Toronto, Ontario, Canada Thomson Reuters Full time

    Site Reliability EngineerWe are seeking a highly skilled Site Reliability Engineer to join our team at Thomson Reuters. As a Site Reliability Engineer, you will be responsible for ensuring the reliability and scalability of our cloud-based infrastructure.About the RoleIn this role, you will be responsible for:Designing and implementing scalable systems and...


  • Old Toronto, Ontario, Canada Thomson Reuters Full time

    About the RoleWe are seeking a skilled Site Reliability Engineer to join our team at Thomson Reuters. As a Site Reliability Engineer, you will be responsible for designing, implementing, and maintaining scalable and reliable systems and services.Key Responsibilities:Design and implement scalable systems and servicesDevelop and maintain tools and scripts to...


  • Old Toronto, Ontario, Canada Thomson Reuters Full time

    About the RoleWe are seeking a skilled Site Reliability Engineer to join our team at Thomson Reuters. As a Site Reliability Engineer, you will be responsible for designing, implementing, and maintaining scalable and reliable systems and services.Key Responsibilities:Design and implement scalable systems and servicesDevelop and maintain tools and scripts to...


  • Old Toronto, Ontario, Canada Reperio Human Capital Full time

    Site Reliability EngineerWe are seeking an experienced Site Reliability Engineer to join our team at Reperio Human Capital. As a key member of our infrastructure team, you will be responsible for ensuring the reliability and scalability of our production systems.Key Responsibilities:Design and implement monitoring and automation solutions to ensure system...


  • Old Toronto, Ontario, Canada Reperio Human Capital Full time

    Site Reliability EngineerWe are seeking an experienced Site Reliability Engineer to join our team at Reperio Human Capital. As a key member of our infrastructure team, you will be responsible for ensuring the reliability and scalability of our production systems.Key Responsibilities:Design and implement monitoring and automation solutions to ensure system...


  • Old Toronto, Canada TD Bank Full time

    Site Reliability Engineer Site Reliability Engineer Work Location: Canada Hours: 37.5 Line of Business: Technology Solutions Pay Details: We’re committed to providing fair and equitable compensation to all our colleagues. As a candidate, we encourage you to have an open dialogue with a member of


  • Old Toronto, Ontario, Canada Reperio Human Capital Full time

    Site Reliability EngineerWe are seeking an experienced Site Reliability Engineer to join our team at Reperio Human Capital. As a key member of our infrastructure team, you will be responsible for ensuring the reliability and scalability of our production systems.Key Responsibilities:Design and implement monitoring and automation solutions to ensure system...


  • Old Toronto, Ontario, Canada Reperio Human Capital Full time

    Site Reliability EngineerWe are seeking an experienced Site Reliability Engineer to join our team at Reperio Human Capital. As a key member of our infrastructure team, you will be responsible for ensuring the reliability and scalability of our production systems.Key Responsibilities:Design and implement monitoring and automation solutions to ensure system...


  • Old Toronto, Canada Reperio Human Capital Full time

    ```htmlSite Reliability Engineer 100421 Location: Ireland/UK Salary: €70K+ Type: Permanent, Full-time We're seeking experienced Site Reliability Engineers who excel at ensuring the reliability and scalability of production systems, and possess extensive experience with monitoring and automation t


  • Old Toronto, Canada https:www.energyjobline.comsitemap.xml Full time

    h3>Site Reliability Engineer 100421 Site Reliability Engineer, SRE, Cloud, Permanent, Remote Type: Permanent, Full-time We're seeking experienced Site Reliability Engineers who excel at ensuring the reliability and scalability of production systems and possess extensive experience with monitoring and automation tools.Ensure the reliability, availability,...


  • Old Toronto, Canada Lorien Full time

    p>Hybrid - ManchesterWe are currently working with a leading gambling company dedicated to providing exceptional gaming experiences. They are looking for an experienced Site Reliability Engineer with a strong skill set in system reliability to join its world-class technology team. This role is ideal for someone who has 4+ years of experience within the...


  • Old Toronto, Canada https:www.energyjobline.comsitemap.xml Full time

    h3>Site Reliability Engineer - 100421 Type: Permanent, Full-time We're seeking experienced Site Reliability Engineers who excel at ensuring the reliability and scalability of production systems, and possess extensive experience with monitoring and automation tools.Ensure the reliability, availability, and performance of production systems Design, implement,...


  • Old Toronto, Canada Reperio Human Capital Full time

    h3>Site Reliability Engineer 100421 Type: Permanent, Full-time We're seeking experienced Site Reliability Engineers who excel at ensuring the reliability and scalability of production systems, and possess extensive experience with monitoring and automation tools.Ensure the reliability, availability, and performance of production systems Design, implement,...


  • Old Toronto, Canada Street Context Full time

    Are you a Site Reliability Engineer that has a passion for building reliable, resilient and performant systems that scale ? Do you command with a steady hand when incidents unfold? Are you motivated by team success ? If so, continue reading… We are on a mission to build and strengthen our engineering teams to match the accelerating success of Street...


  • Old Toronto, Canada Street Context Full time

    p>Are you a Site Reliability Engineer that has a passion for building reliable, resilient and performant systems that scale? p>We are on a mission to build and strengthen our engineering teams to match the accelerating success of Street Context. We provide a premium Email, Analytics and Broker Relationship platform, purpose-built for capital markets and...