Sr. Site Reliability Engineer, Chaos Engineering
2 weeks ago
Job Description
What is the Opportunity?
We are seeking an experienced and innovative Lead Site Reliability Engineer (SRE) to spearhead the implementation of Chaos Engineering practices across all Digital Channels. This senior-level role is critical to ensuring the resilience, scalability, and availability of our systems in high-stress environments while driving operational excellence within the organization.
What will you do?
As the Lead SRE specialized in Chaos Engineering, your responsibilities will include:
Chaos Engineering Implementation:
Design and execute chaos experiments using tools such as Gremlin to proactively test systems under stress.
Simulate failure scenarios to identify potential risks and validate system behavior during degradation or outages.
Ensure experiments yield actionable insights for improving resilience.
Assess and Validate Autoscaling:
Analyze and validate autoscaling policies across individual systems to ensure optimal performance under variable loads.
Collaborate with engineering teams to refine and implement dynamic scaling strategies based on experimental outcomes.
Resiliency Reporting:
Develop comprehensive reports that outline system resiliency metrics, findings from chaos experiments, and recommendations for improvement.
Provide insights to leadership and technical teams to guide decision-making on infrastructure updates and architectural enhancements.
High Availability Architectures:
Embed redundancy patterns such as failover mechanisms and active-active configurations to achieve high availability across key systems.
Lead efforts to integrate seamless failover processes during both planned and unplanned downtimes.
Collaboration and Communication:
Act as a subject matter expert and promote Chaos Engineering practices across teams.
Partner with DevOps, infrastructure, and application teams to ensure resilience objectives align with broader organizational goals.
Facilitate training sessions and knowledge-sharing initiatives on Chaos Engineering concepts for technical staff.
What do you need to succeed?
Must Have:
5+ years of experience in Site Reliability Engineering or a related role, with a minimum of 2 years focused on Chaos Engineering.
Hands-on experience in designing and implementing reliable, scalable, and fault-tolerant systems.
Strong proficiency in Chaos Engineering tools like Gremlin, Chaos Monkey, or similar platforms.
Deep understanding of cloud infrastructure (AWS, Azure, GCP) and concepts like load balancing, autoscaling, failover, and high availability.
Proven expertise in monitoring and observability tools like Prometheus, Grafana, or Datadog.
Nice to Have:
Excellent analytical, problem-solving, and decision-making abilities.
Strong collaboration and communication skills to influence cross-functional teams.
A proactive and innovative mindset to drive continuous improvements.
What's in it for you?
We thrive on the challenge to be our best, progressive thinking to keep growing, and working together to deliver trusted advice to help our clients thrive and communities prosper. We care about each other, reaching our potential, making a difference to our communities, and achieving success that is mutual.
A comprehensive Total Rewards Program including bonuses and flexible benefits, competitive compensation, commissions, and stock where applicable
Leaders who support your development through coaching and managing opportunities
Ability to make a difference and lasting impact
Work in a dynamic, collaborative, progressive, and high-performing team
A world-class training program in financial services
Flexible work/life balance options
Opportunities to do challenging work
Opportunities to take on progressively greater accountabilities
Opportunities to building close relationships with clients
#LI-POST
#TECHPJ
Job Skills
Chaos Engineering, Cloud Infrastructure, Site Reliability EngineeringAdditional Job Details
Address:
RBC WATERPARK PLACE, 88 QUEENS QUAY W:TORONTOCity:
TorontoCountry:
CanadaWork hours/week:
Employment Type:
Full timePlatform:
TECHNOLOGY AND OPERATIONSJob Type:
RegularPay Type:
SalariedPosted Date:
Application Deadline:
Note: Applications will be accepted until 11:59 PM on the day prior to the application deadline date above
Inclusion and Equal Opportunity Employment
At RBC, we believe an inclusive workplace that has diverse perspectives is core to our continued growth as one of the largest and most successful banks in the world. Maintaining a workplace where our employees feel supported to perform at their best, effectively collaborate, drive innovation, and grow professionally helps to bring our Purpose to life and create value for our clients and communities. RBC strives to deliver this through policies and programs intended to foster a workplace based on respect, belonging and opportunity for all.
Join our Talent Community
Stay in-the-know about great career opportunities at RBC. Sign up and get customized info on our latest jobs, career tips and Recruitment events that matter to you.
Expand your limits and create a new future together at RBC. Find out how we use our passion and drive to enhance the well-being of our clients and communities
-
Sr. Site Reliability Engineer, Chaos Engineering
2 weeks ago
Toronto, Ontario, Canada RBC Full time $120,000 - $180,000 per yearJob DescriptionWhat is the Opportunity?We are seeking an experienced and innovative Lead Site Reliability Engineer (SRE) to spearhead the implementation of Chaos Engineering practices across all Digital Channels. This senior-level role is critical to ensuring the resilience, scalability, and availability of our systems in high-stress environments while...
-
Senior Site Reliability Engineer
7 hours ago
Toronto, Ontario, Canada RBC Full time $90,000 - $120,000 per yearJob DescriptionWhat is the opportunity?Join our Commercial, Core Banking and Payments Technology (CCBPT) team as a Senior Site Reliability Engineer, where you'll play a key role in supporting our cloud and distributed environments for the Personal Commercial Credit SRE & Ops team. This exciting opportunity will challenge you to work with cutting-edge...
-
Site Reliability Engineer
9 hours ago
Toronto, Ontario, Canada Procom Full time $80,000 - $120,000 per yearSite Reliability Engineer (SRE)/ Ingénieur Fiabilité des SitesOn behalf of our banking client, Procom is seeking a Site Reliability Engineer (SRE) for a 12-month contract role. This position is a hybrid role, 3 days a week onsite at our client's Montréal, Quebec office.Site Reliability Engineer - Job Description:The Site Reliability Engineer is...
-
Senior Site Reliability Engineer
2 weeks ago
Toronto, Ontario, Canada Vitech Systems Group Full time $120,000 - $180,000 per yearDepartment:Development Operations (DevOps)Location:CanadaDescriptionAt Vitech, we believe in the power of technology to simplify complex business processes. Our mission is to bring better software solutions to market, addressing the intricacies of the insurance and retirement industries. We combine deep domain expertise with the latest technological...
-
Site Reliability Engineer
7 hours ago
Toronto, Ontario, Canada McCain Foods Full time $102,700 - $137,000 per yearPosition Title:Site Reliability EngineerPosition Type:Regular - Full-TimePosition Location:Toronto HQRequisition ID:36904Our Global Technology team's goal is to leverage technology and data to drive profitable growth, focus on enhancing customer experience and to further our purpose of 'Celebrating real connections through delicious, planet-friendly food'....
-
Site Reliability Engineer
2 weeks ago
Toronto, Ontario, Canada Tekgence Inc Full time $80,000 - $120,000 per yearHello,Please find the Job Description belowSite Reliability Engineering (SRE)Toronto ONSkills Required: Digital : Python Digital : Google Cloud Digital : Site Reliability Engineering (SRE)Job Description:Gather and analyze metrics from operating systems as well as applications to assist in performance tuning and fault findingPartner with development teams to...
-
Site Reliability Engineer
9 hours ago
Toronto, Ontario, Canada Maneva Full time US$80,000 - US$120,000 per yearAbout ManevaManeva builds and deploys edge AI solutions powering real-time intelligence for industrial environments. Our systems run on distributed edge compute devices (NVIDIA Jetson platforms), integrate with local network cameras, PLCs, sensors, and other on-premise equipment, and securely communicate with cloud services via client- or site-based VPNs....
-
Site Reliability Engineer
6 days ago
Toronto, Ontario, Canada Tecsys Inc. Full time $85,000 - $130,000 per yearHaving recognized the advantages of remote work, including employee morale, productivity, reduced commuting on employee wellbeing and the environment, we are proud to be a digital-first company. The technologies and programs in which we invested have provided a fantastic foundation to this end. Our digital-first work environment, together with our...
-
Site Reliability Engineer
1 week ago
Toronto, Ontario, Canada Pixomondo Full time $120,000 - $180,000 per yearWe're seeking an experienced Site Reliability Engineer to join our team and lead infrastructure automation, CI/CD workflows, and deployment operations for a custom web platform. You'll be working with a modern DevOps stack including GitHub Actions, GCP, Kubernetes, Terraform, PostgreSQL, CodeDeploy, and Cloudflare to ensure our platform is robust, scalable,...
-
Site Reliability Engineer
8 hours ago
Toronto, Ontario, Canada Apptoza Inc. Full time $30,000 - $120,000 per yearHI,Hope you are doing Great,If you are fine with below JD please share me your Updated resume ASAP.Site Reliability EngineerLocation: TORONTO (ONSITE)Duration: 6 monthsExp Required: 10 YearsJob Description: Job Title : SRETechnical/Functional Skills• 8+ years of overall IT experience.• Advanced Linux / Unix support experience required.• Strong shell...