Manager, Site Reliability Engineering
1 week ago
With a career at The Home Depot, you can be yourself and also be part of something bigger.
Position Overview:
The Manager, SRE will lead a team of Site Reliability Engineers to ensure the reliability, performance, and operational support of our eCommerce systems, with a focus on Google Cloud Platform (GCP) environments. This role requires a strong background in reliability reviews, performance engineering practices, production engineering, and operational support, with emphasis on DevOps principles and GCP expertise.
Responsibilities:
- Leadership & Management:
- Lead and mentor a team of Site Reliability Engineers
- Foster a culture of continuous improvement and innovation
- Collaborate with cross-functional teams to align SRE practices with business objectives
- Reliability & Performance:
- Conduct reliability reviews to identify areas for improvement and implement solutions to enhance system reliability, particularly in GCP environments
- Implement and promote performance engineering practices to ensure optimal system performance on GCP
- Develop and maintain service level objectives (SLOs) and error budgets
- Production Engineering & Operational Support:
- Oversee production engineering efforts to ensure systems are designed for operational excellence and reliability, leveraging GCP services and best practices
- Manage incident response and post-incident reviews to minimize downtime and improve system resilience
- Implement monitoring, alerting, and observability solutions to proactively identify and address issues
- Develop and maintain runbooks and playbooks for common operational tasks
- Coordinate with security teams to ensure compliance with security policies and best practices
- DevOps & Continuous Improvement:
- Drive DevOps initiatives to improve collaboration between development and operations teams, with a focus on GCP-native tools and services
- Implement and maintain CI/CD pipelines to streamline deployment processes in GCP environments
- Identify and implement automation opportunities to reduce manual tasks and improve efficiency
- Promote the use of Infrastructure as Code (IaC) to manage and provision cloud resources
- Continuously evaluate and integrate new tools and technologies to enhance DevOps practices
- Release Management:
- Implement and maintain release management best practices to minimize disruptions and maximize system stability
- Collaborate with DevOps teams to integrate release management into CI/CD pipelines
- Oversee release schedules, ensuring minimal impact on business operations
- Ensure there is a rigorous release readiness process in place that includes reviews and post-release retrospectives
- Maintain a release calendar and communicate release plans to stakeholders
- Strategic Planning:
- Create and maintain a strategic roadmap for SRE initiatives, aligning with business goals and technological advancements
- Refine and standardize Standard Operating Procedures (SOPs) to enhance operational efficiency and consistency
- Address customer pain points by developing and implementing solutions that improve user experience and system reliability
- Engage with stakeholders to understand their needs and incorporate feedback into strategic planning and execution
- Monitor industry trends and best practices to ensure the SRE team remains at the forefront of technology
Experience:
- Bachelor’s degree in computer science, Engineering, or a related field
- Strong problem-solving and analytical abilities
- Excellent communication and collaboration skills
- 4-6 years of relevant work experience, including significant experience with GCP
- Extensive experience with cloud infrastructure, GCP services and architecture
- Proven track record of managing and optimizing large-scale systems on GCP
- Proven ability to effectively communicate with individuals at all levels of the organization
- Ability to maintain relationships and negotiate with vendors
- Ability to operate in and leverage resources in a matrixed environment
- Ability to analyze and present data to support ideas
- Ability to clearly communicate to all levels of the organization
-
Site Reliability Engineer
4 weeks ago
Old Toronto, Canada CB Canada Full timeSite Reliability Engineer On behalf of our client in the Banking Sector, PROCOM is looking for a Site Reliability Engineer. Site Reliability Engineer – Job Description Azure cloud Jira and Confluence CICD Experience with automating (provisioning, configuration management, deployment) and integratin
-
Site Reliability Engineer
7 days ago
Old Toronto, Canada Lorien Full timeHybrid - Manchester We are currently working with a leading gambling company dedicated to providing exceptional gaming experiences. They are looking for an experienced Site Reliability Engineer with a strong skill set in system reliability to join its world-class technology team. This role is ideal for someone who has 4+ years of experience within the...
-
Site Reliability Engineer
4 weeks ago
Old Toronto, Ontario, Canada Thomson Reuters Full timeSite Reliability EngineerWe are seeking a highly skilled Site Reliability Engineer to join our team at Thomson Reuters. As a Site Reliability Engineer, you will play a critical role in ensuring the reliability and efficiency of our cloud-based infrastructure.About the RoleIn this position, you will be responsible for:Designing and implementing scalable...
-
Site Reliability Engineer
4 weeks ago
Old Toronto, Ontario, Canada Thomson Reuters Full timeSite Reliability EngineerWe are seeking a highly skilled Site Reliability Engineer to join our team at Thomson Reuters. As a Site Reliability Engineer, you will play a critical role in ensuring the reliability and efficiency of our cloud-based infrastructure.About the RoleIn this position, you will be responsible for:Designing and implementing scalable...
-
Site Reliability Engineer
4 weeks ago
Old Toronto, Ontario, Canada Thomson Reuters Full timeSite Reliability EngineerWe are seeking a highly skilled Site Reliability Engineer to join our team at Thomson Reuters. As a Site Reliability Engineer, you will be responsible for ensuring the reliability and scalability of our cloud-based infrastructure.About the RoleIn this role, you will be responsible for:Designing and implementing scalable systems and...
-
Site Reliability Engineer
4 weeks ago
Old Toronto, Ontario, Canada Thomson Reuters Full timeSite Reliability EngineerWe are seeking a highly skilled Site Reliability Engineer to join our team at Thomson Reuters. As a Site Reliability Engineer, you will be responsible for ensuring the reliability and scalability of our cloud-based infrastructure.About the RoleIn this role, you will be responsible for:Designing and implementing scalable systems and...
-
Site Reliability Engineer
4 weeks ago
Old Toronto, Ontario, Canada Thomson Reuters Full timeAbout the RoleWe are seeking a skilled Site Reliability Engineer to join our team at Thomson Reuters. As a Site Reliability Engineer, you will be responsible for designing, implementing, and maintaining scalable and reliable systems and services.Key Responsibilities:Design and implement scalable systems and servicesDevelop and maintain tools and scripts to...
-
Site Reliability Engineer
4 weeks ago
Old Toronto, Ontario, Canada Thomson Reuters Full timeAbout the RoleWe are seeking a skilled Site Reliability Engineer to join our team at Thomson Reuters. As a Site Reliability Engineer, you will be responsible for designing, implementing, and maintaining scalable and reliable systems and services.Key Responsibilities:Design and implement scalable systems and servicesDevelop and maintain tools and scripts to...
-
Site Reliability Engineer
4 weeks ago
Old Toronto, Ontario, Canada Reperio Human Capital Full timeSite Reliability EngineerWe are seeking an experienced Site Reliability Engineer to join our team at Reperio Human Capital. As a key member of our infrastructure team, you will be responsible for ensuring the reliability and scalability of our production systems.Key Responsibilities:Design and implement monitoring and automation solutions to ensure system...
-
Site Reliability Engineer
4 weeks ago
Old Toronto, Ontario, Canada Reperio Human Capital Full timeSite Reliability EngineerWe are seeking an experienced Site Reliability Engineer to join our team at Reperio Human Capital. As a key member of our infrastructure team, you will be responsible for ensuring the reliability and scalability of our production systems.Key Responsibilities:Design and implement monitoring and automation solutions to ensure system...
-
Site Reliability Engineer
3 weeks ago
Old Toronto, Canada TD Bank Full timeSite Reliability Engineer Site Reliability Engineer Work Location: Canada Hours: 37.5 Line of Business: Technology Solutions Pay Details: We’re committed to providing fair and equitable compensation to all our colleagues. As a candidate, we encourage you to have an open dialogue with a member of
-
Site Reliability Engineer
4 weeks ago
Old Toronto, Ontario, Canada Reperio Human Capital Full timeSite Reliability EngineerWe are seeking an experienced Site Reliability Engineer to join our team at Reperio Human Capital. As a key member of our infrastructure team, you will be responsible for ensuring the reliability and scalability of our production systems.Key Responsibilities:Design and implement monitoring and automation solutions to ensure system...
-
Site Reliability Engineer
4 weeks ago
Old Toronto, Ontario, Canada Reperio Human Capital Full timeSite Reliability EngineerWe are seeking an experienced Site Reliability Engineer to join our team at Reperio Human Capital. As a key member of our infrastructure team, you will be responsible for ensuring the reliability and scalability of our production systems.Key Responsibilities:Design and implement monitoring and automation solutions to ensure system...
-
Site Reliability Engineer
1 month ago
Old Toronto, Canada Reperio Human Capital Full time```htmlSite Reliability Engineer 100421 Location: Ireland/UK Salary: €70K+ Type: Permanent, Full-time We're seeking experienced Site Reliability Engineers who excel at ensuring the reliability and scalability of production systems, and possess extensive experience with monitoring and automation t
-
AWS Site Reliability Engineer
3 weeks ago
Old Toronto, Canada https:www.energyjobline.comsitemap.xml Full timeh3>Site Reliability Engineer 100421 Site Reliability Engineer, SRE, Cloud, Permanent, Remote Type: Permanent, Full-time We're seeking experienced Site Reliability Engineers who excel at ensuring the reliability and scalability of production systems and possess extensive experience with monitoring and automation tools.Ensure the reliability, availability,...
-
AWS Site Reliability Engineer
1 week ago
Old Toronto, Canada Lorien Full timep>Hybrid - ManchesterWe are currently working with a leading gambling company dedicated to providing exceptional gaming experiences. They are looking for an experienced Site Reliability Engineer with a strong skill set in system reliability to join its world-class technology team. This role is ideal for someone who has 4+ years of experience within the...
-
AWS Site Reliability Engineer
2 weeks ago
Old Toronto, Canada https:www.energyjobline.comsitemap.xml Full timeh3>Site Reliability Engineer - 100421 Type: Permanent, Full-time We're seeking experienced Site Reliability Engineers who excel at ensuring the reliability and scalability of production systems, and possess extensive experience with monitoring and automation tools.Ensure the reliability, availability, and performance of production systems Design, implement,...
-
AWS Site Reliability Engineer
4 weeks ago
Old Toronto, Canada Reperio Human Capital Full timeh3>Site Reliability Engineer 100421 Type: Permanent, Full-time We're seeking experienced Site Reliability Engineers who excel at ensuring the reliability and scalability of production systems, and possess extensive experience with monitoring and automation tools.Ensure the reliability, availability, and performance of production systems Design, implement,...
-
Site Reliability Engineer
3 days ago
Old Toronto, Canada Street Context Full timeAre you a Site Reliability Engineer that has a passion for building reliable, resilient and performant systems that scale ? Do you command with a steady hand when incidents unfold? Are you motivated by team success ? If so, continue reading… We are on a mission to build and strengthen our engineering teams to match the accelerating success of Street...
-
AWS Site Reliability Engineer
3 days ago
Old Toronto, Canada Street Context Full timep>Are you a Site Reliability Engineer that has a passion for building reliable, resilient and performant systems that scale? p>We are on a mission to build and strengthen our engineering teams to match the accelerating success of Street Context. We provide a premium Email, Analytics and Broker Relationship platform, purpose-built for capital markets and...