Lead Site Reliability Engineer

3 weeks ago


Old Toronto, Ontario, Canada PagerDuty, Inc. Full time

PagerDuty empowers diverse teams to execute essential tasks that drive business success through the PagerDuty Operations Cloud.

We are looking for a Senior Site Reliability Engineer to become a vital member of our SRE-Platform team. In this capacity, you will play a significant role in developing, sustaining, and enhancing the Kubernetes infrastructure that underpins PagerDuty. Our mission is to create solutions that boost developer efficiency, enhance system reliability, and support PagerDuty's growth for the future. If you have a strong interest in platform engineering, developer experience, and Kubernetes, we would be eager to connect with you.

Key Responsibilities

  • Maintain the overall integrity of the platform, which includes diagnosing and resolving production challenges, overseeing system capacity, and collaborating with other technical teams to ensure compliance with security and best practices.
  • Collaborate with Engineering stakeholders to architect and deliver a platform that is reliable, scalable, secure, and high-performing.
  • Consistently seek to enhance the developer experience through full lifecycle support (creation, development, deployment, retirement), observability, flexible connectivity, and monitoring.
  • Share your knowledge and expertise across the entire Engineering organization.
  • Participate in a 24/7 on-call rotation, utilizing PagerDuty to manage on-call schedules.

Basic Qualifications

  • 5+ years of experience in Platform Engineering, Site Reliability Engineering, or DevOps roles.
  • Experience managing multiple Kubernetes clusters in a production setting.
  • Experience with cloud-native infrastructure (e.g., AWS, GCP, Azure).
  • Experience deploying web applications on Kubernetes (Helm, ArgoCD).
  • Proficiency in infrastructure as code (i.e., Terraform or CloudFormation).
  • Familiarity with a dynamic programming language (i.e., Ruby or Python).

Preferred Qualifications

  • Experience with monitoring, observability, and logging platforms (e.g., DataDog, New Relic, SumoLogic, Splunk).
  • Knowledge of configuration management systems (e.g., Ansible, Chef, Puppet).
  • Experience in automating releases, continuous integration/delivery systems, and relevant tools (e.g., Jenkins, CircleCI, Travis CI, Buildkite).

The base salary range for this position is 152,000 USD. This role may also be eligible for bonuses, commissions, equity, and/or benefits.

PagerDuty is dedicated to fostering a diverse environment and is an equal opportunity employer. We do not discriminate based on race, religion, color, national origin, gender, sexual orientation, age, marital status, parental status, veteran status, or disability status.



  • Toronto, Ontario, Canada Lightspeed Restaurant Full time

    Lead Site Reliability Engineer at Lightspeed RestaurantWe are seeking a skilled Lead Site Reliability Engineer to become a vital part of our Lightspeed Restaurant team. Our mission is to create innovative software solutions that empower restaurants to enhance their operational efficiency and profitability.In the role of Lead Site Reliability Engineer, you...


  • Old Toronto, Ontario, Canada PagerDuty, Inc. Full time

    PagerDuty empowers diverse teams to perform essential tasks that drive business success through the PagerDuty Operations Cloud.We are in search of a Senior Site Reliability Engineer to become a vital member of our SRE-Platform team. In this capacity, you will play a crucial role in developing, sustaining, and scaling the Kubernetes infrastructure that...


  • Old Toronto, Ontario, Canada PagerDuty, Inc. Full time

    PagerDuty empowers diverse teams to drive essential operations that propel business growth through the PagerDuty Operations Cloud.We are in search of a Senior Site Reliability Engineer to become a vital member of our SRE-Platform team. In this capacity, you will play a crucial role in developing, sustaining, and enhancing the Kubernetes infrastructure that...


  • Toronto, Ontario, Canada Thomson Reuters Full time

    About the RoleThis is an exciting opportunity to join our team as a Lead Site Reliability Engineer at Thomson Reuters. As a key member of our engineering team, you will be responsible for leading and mentoring a team of SREs, providing technical guidance, coaching, and support to foster a culture of collaboration, innovation, and continuous improvement.Key...


  • Toronto, Ontario, Canada Thomson Reuters Full time

    About the RoleThis is an exciting opportunity to join our team as a Lead Site Reliability Engineer at Thomson Reuters. As a key member of our engineering team, you will be responsible for leading and mentoring a team of SREs, providing technical guidance, coaching, and support to foster a culture of collaboration, innovation, and continuous improvement.Key...


  • Toronto, Ontario, Canada Northbridge Financial Corporation Full time

    Overview of the Senior Site Reliability Engineer Role at Northbridge Financial Corporation The Senior Site Reliability Engineer is responsible for the development and execution of Service Level Objectives (SLOs). This role involves managing complex service reliability solutions and processes, as well as mentoring and guiding junior SREs. Key...


  • Toronto, Ontario, Canada Northbridge Financial Corporation Full time

    Overview of the Senior Site Reliability Engineer Role at Northbridge Financial Corporation The Senior Site Reliability Engineer is responsible for the establishment and execution of Service Level Objectives (SLOs). This role involves managing complex service reliability solutions and processes, while also providing mentorship and guidance to junior...


  • Old Toronto, Ontario, Canada Moneris Full time

    Your Moneris Career - The Opportunity Moneris stands as a leader in payment processing, recognized as Canada's foremost provider and one of the largest in North America. Connect. Impact. Grow. Become part of one of Canada's esteemed employers and leave your mark at Moneris. The Senior Site Reliability Engineer at Moneris works in collaboration with various...


  • Old Toronto, Ontario, Canada Magic Leap - Multiple Locations Full time

    Transforming the Future of ComputingMagic Leap stands at the forefront of spatial computing, innovating advanced augmented reality solutions that integrate digital elements with the physical environment. As a leader in the next generation of computing platforms, our mixed reality devices open up new avenues for interaction and engagement with the world...


  • Toronto, Ontario, Canada Northbridge Financial Corporation Full time

    Overview of the Senior Site Reliability Engineer Role at Northbridge Financial Corporation The Senior Site Reliability Engineer is responsible for the establishment and execution of Service Level Objectives (SLOs). This role involves managing service reliability solutions and processes of increasing intricacy, along with mentoring and guiding junior...


  • Old Toronto, Ontario, Canada SoundHound Inc Full time

    About SoundHound AI: At SoundHound AI, we envision a world where every individual can seamlessly interact with technology through natural conversation. Our innovative Voice AI solutions cater to various sectors, including automotive and food services, empowering brands to connect with their audiences in meaningful ways.Role Overview: We are seeking a...


  • Toronto, Ontario, Canada CIRCLE Full time

    About Circle: Circle is a pioneering financial technology firm positioned at the forefront of the evolving digital economy, where value can traverse globally, almost instantaneously, and at a lower cost compared to traditional settlement systems. This innovative layer of the internet unveils extraordinary opportunities for transactions, commerce, and...


  • Old Toronto, Ontario, Canada SoundHound Inc Full time

    About SoundHound AISoundHound AI is dedicated to enabling seamless interactions between individuals and technology through natural language. Our innovative Voice AI solutions cater to diverse applications, including automotive systems and restaurant services, empowering brands to engage with their customers in meaningful ways.Role OverviewThis position...


  • Old Toronto, Ontario, Canada SoundHound Inc Full time

    About SoundHound AISoundHound AI is dedicated to enabling seamless interaction with technology through natural language. Our innovative Voice AI solutions cater to various industries, enhancing user experiences and brand engagement.Role OverviewAs a vital member of our Site Reliability Engineering (SRE) team, you will be instrumental in developing robust...


  • Toronto, Ontario, Canada Northbridge Financial Corporation Full time

    Join Northbridge Financial Corporation as a Site Reliability Engineering LeadThe Site Reliability Engineering Lead is essential in maintaining the dependability, efficiency, and accessibility of our primary insurance systems. Collaborating closely with both application and infrastructure teams, your focus will be on preventing incidents, managing...


  • Toronto, Ontario, Canada Thomson Reuters Full time

    About the RoleThis is an exciting opportunity to lead a team of Site Reliability Engineers (SREs) at Thomson Reuters, a leading provider of news, information, and technology solutions to professionals in the legal, tax, accounting, and compliance markets.Key ResponsibilitiesTeam Leadership: Lead and mentor a team of SREs, providing technical guidance,...


  • Toronto, Ontario, Canada Thomson Reuters Full time

    About the RoleThis is an exciting opportunity to lead a team of Site Reliability Engineers (SREs) at Thomson Reuters, a leading provider of news, information, and technology solutions to professionals in the legal, tax, accounting, and compliance markets.Key ResponsibilitiesTeam Leadership: Lead and mentor a team of SREs, providing technical guidance,...


  • Toronto, Ontario, Canada Relay Financial Full time

    About Relay Financial:At Relay, we are revolutionizing the way businesses manage their finances. Traditional banking has often hindered growth for business owners, and we are committed to changing that narrative. Our platform is designed to be an all-in-one, collaborative solution for money management, tailored specifically for small to medium-sized...


  • Old Toronto, Ontario, Canada Akamai Full time

    Are you passionate about technology and teamwork? If you enjoy collaborating with diverse teams to tackle intricate challenges, consider joining our esteemed Nameserver SRE team.The Nameserver SRE team plays a pivotal role in defining, measuring, and optimizing the key performance indicators of Akamai's nameserver platform. We adopt a comprehensive approach...


  • Toronto, Ontario, Canada Alliancesrcare Full time

    About the RoleAt Alliancesrcare, we are transforming the landscape of financial services by offering a comprehensive platform for small to medium-sized businesses. We are in search of a Lead Site Reliability Engineer to become a pivotal member of our Trust team and contribute to the evolution of our services.Key ResponsibilitiesOversee and manage production...