Principal Site Reliability Engineer

6 days ago

BC Canada Red Hat Full time

Join to apply for the Principal Site Reliability Engineer role at Red Hat About the Job We’re seeking an Site Reliability Engineer (SRE) with passion for maintaining highly reliable cloud-based services. In this role, you will support Red Hat’s software manufacturing services on our hybrid cloud infrastructure. You will partner with development, quality engineering and release engineering colleagues to support the health and well‑being of the infrastructure hosting Software Production services. Creating/maintaining service monitoring, improving automation, upholding security best practices and responding to various service situations will be your daily work. You will participate in communities of practice to coordinate and influence the design of our hybrid cloud platform. You will be co‑responsible for defining Service Level Indicators (SLIs) and Service Level Objectives (SLOs) for the services the team supports, and executing remediation plans if the SLOs are not met. In this role, you will be expected to respond in a timely manner during a service outage and participate in learning events to identify improvements that will make our services more resilient. What You’ll Do Be part of a globally distributed team, offering 24x7 support through a service model that leverages different time zones to extend coverage with regular on‑call rotations. Resolve service incidents by using existing operating procedures, investigate outage causes and coordinate incident resolution across various service teams. Act as a leader and mentor to less experienced colleagues, bring and drive continuous improvement ideas and help the team to benefit from technology evolution, such as AI tools utilization. Collaborate on incident retrospective reviews and corrective items implementation. Configure and maintain service infrastructure. Proactively identify and eliminate toil by automating manual, repetitive, and error‑prone processes. Coordinate actions with other Red Hat teams such as IT Platforms, Infrastructure, Storage and Network and ensure our services cloud deployment meets quality expectations. Implement monitoring, alerting and escalation plans in the event of an infrastructure outage or performance problem. Work with service owners to co‑define and implement SLIs and SLOs for the services you’ll support, ensure those are met and execute remediation plans if they are not. What You’ll Bring Expert knowledge of OpenShift administration and application development. Linux administration expertise. Advanced knowledge of automation services: ArgoCD, Ansible or Terraform. Advanced knowledge of CI/CD platforms: Tekton and Pipelines as a code (optionally GitHub Actions or Jenkins). Advanced knowledge and experience with monitoring platforms and technologies. General knowledge of AWS technologies. Ability to understand graphically represented concepts and architectures in documentation. Experience with creation of Standard Operating Procedures. Knowledge of open source monitoring technologies (Grafana, Prometheus, OpenTelemetry). Excellent written and verbal communication skills in English. Plus Skills Previous experience with SRE model. Experience with software development using Python or GoLang. Experience with automation design and implementation. About Red Hat Red Hat is the world’s leading provider of enterprise open source software solutions, using a community‑powered approach to deliver high‑performing Linux, cloud, container, and Kubernetes technologies. Inclusion at Red Hat Red Hat’s culture is built on the open source principles of transparency, collaboration, and inclusion, where the best ideas can come from anywhere and anyone. Equal Opportunity Policy (EEO) Red Hat is proud to be an equal opportunity workplace and an affirmative action employer. We review applications for employment without regard to race, color, religion, sex, sexual orientation, gender identity, national origin, ancestry, citizenship, age, veteran status, genetic information, physical or mental disability, medical condition, marital status, or any other basis prohibited by law. #J-18808-Ljbffr

Staff Site Reliability Engineer

4 weeks ago

, BC, Canada Branch Full time

Overview At Branch, we’re transforming how brands and users interact across digital platforms. Our mobile marketing and deep linking solutions deliver seamless experiences that increase ROI, decrease wasted spend, and eliminate siloed attribution. Our team values ownership, collaboration, and a motto: Build Together, Grow Together, Win Together. As a Staff...
Site Reliability Engineer

3 days ago

, BC, Canada Red Hat Full time

About The Job We’re seeking an Site Reliability Engineer (SRE) with passion for maintaining highly reliable cloud-based services. In this role, you will support Red Hat’s software manufacturing services on our hybrid cloud infrastructure. You will partner with development, quality engineering and release engineering colleagues to support the health and...
Site Reliability Engineer

3 weeks ago

, , Canada Orion Innovation Full time

Senior Site Reliability Engineer (SRE) with Kubernetes & Rancher Location: Canada - Remote (Working EST hours) Job Type: Full-time About the Role Are you an exceptional Site Reliability Engineer with a passion for building and maintaining highly resilient and secure systems? We are seeking a Senior SRE to join our team and play a critical role in managing...
Senior Site Reliability Engineer

3 weeks ago

, BC, Canada GoDaddy Full time

Location and Work Arrangement Location Details: Canada - Remote. This is a remote position, so you’ll be working remotely from your home. You may occasionally visit a GoDaddy office to meet with your team for events or meetings. Join Our Team GoDaddy's Infrastructure Engineering team is looking for a Senior Site Reliability Engineer with a focus on...
Senior Site Reliability Engineer

3 days ago

, , Canada Thinkific Full time

Join to apply for the Senior Site Reliability Engineer role at Thinkific Join to apply for the Senior Site Reliability Engineer role at Thinkific Are you an experienced Site Reliability Engineer looking for a new challenge? We’re looking for a Senior Site Reliability Engineer to join us at Thinkific. We’re looking for a Senior Site Reliability Engineer...
Systems Reliability Engineer

1 week ago

(s): Canada : Ontario : Toronto Scotiabank Global Site Full time $120,000 - $180,000 per year

Requisition ID: 239640Join a purpose driven winning team, committed to results, in an inclusive and high-performing culture.The RoleAs a member of the Systems Reliability Engineering team, the System Reliability Engineer will collaborate closely with Engineering and development teams, peers, and business partners to continuously improve the stability,...
Principal Site Reliability Engineer

1 week ago

Remote CA BC Red Hat Full time $900,000 - $1,200,000 per year

About the JobWe're seeking an Site Reliability Engineer (SRE) with passion for maintaining highly reliable cloud-based services. In this role, you will support Red Hat's software manufacturing services on our hybrid cloud infrastructure. You will partner with development, quality engineering and release engineering colleagues to support the health and...
Director, Site Reliability Engineering

3 weeks ago

, , Canada Icon Full time

Helping SaaS companies scale Engineering teams. Director, Site Reliability Engineering We are seeking an accomplished Director of Site Reliability Engineering (SRE) to lead the reliability, scalability, and performance initiatives across multiple enterprise technology domains, including AML, Risk, Finance, Corporate Treasury, and Human Resources systems....
Senior Site Reliability Engineer

3 weeks ago

, , Canada Orion Innovation Full time

Job Description: Senior Site Reliability Engineer (SRE) with Kubernetes & Rancher Location: Canada - Remote (Working EST hours) Job Type: Full-time About the Role Are you an exceptional Site Reliability Engineer with a passion for building and maintaining highly resilient and secure systems? We are seeking a Senior SRE to join our team and play a critical...
Senior Site Reliability Engineer

3 weeks ago

, , Canada Akamai Technologies Full time

Senior Site Reliability Engineer Join Akamai Technologies as we build a reliable, secure, and scalable Internet. We are looking for a Senior Site Reliability Engineer to help us solve complex performance and reliability challenges. Job Description Are you passionate about cutting‑edge technology and ready to tackle some of the Internet’s most difficult...

Americas

Europe

Asia / Oceania

Africa

Principal Site Reliability Engineer