Manager, Site Reliability Engineer

1 week ago


Canada Command Alkon Incorporated. Full time

Title: Manager, Site Reliability Engineer (SRE) Summary of Role The Site Reliability Engineer (SRE) Manager leads the teams responsible for ensuring the availability, performance, and reliability of mission‑critical systems. This role bridges the gap between software engineering and operations by implementing automation, observability, and scalability practices. The SRE Manager sets the vision for reliability engineering, enabling rapid product delivery while maintaining high service uptime and customer trust. Responsibilities span building resilient infrastructure, driving incident management processes, optimizing system performance, and fostering a culture of continuous improvement. The role also requires staying ahead of industry practices in monitoring, automation, and distributed systems, ensuring the organization delivers secure, reliable, and scalable services. How You’ll Succeed Ensure Reliability & Uptime: Monitor and manage the reliability of production systems, maintaining high availability and scalability across global environments. Incident Leadership: Lead incident response, root cause analysis, and postmortem reviews while driving systemic improvements. Automation First: Reduce manual work by implementing automation for deployments, monitoring, capacity management, and self‑healing systems. Service Level Ownership: Define, track, and enforce SLAs, SLOs, and SLIs across services, ensuring alignment with business objectives. Operational Excellence: Optimize performance, capacity planning, disaster recovery, and resilience engineering practices. Cross‑Team Collaboration: Partner with product engineering, DevOps, and security to design for reliability from the ground up. Team Leadership: Mentor, coach, and guide SRE teams to drive technical growth and operational maturity. Process Improvement: Continuously refine reliability engineering processes, integrating lessons learned into new standards and SOPs. Customer‑Centric Mindset: Advocate for reliability as a core feature, ensuring customer experience and trust are at the forefront. What You Bring Strong leadership experience managing SRE or operations‑focused engineering teams. Expertise in distributed systems, cloud‑native architectures, and large‑scale production environments. Proficiency with observability tools, performance monitoring, and incident management frameworks. Deep knowledge of automation, CI/CD, infrastructure as code, and cloud services (AWS, Azure, GCP). Familiarity with chaos engineering, resilience design patterns, and capacity planning. Clear understanding of SLAs, SLOs, and SLIs and how to implement them across services. Solid background in system security, compliance, and risk management in production environments. Ability to balance reliability with speed of delivery by partnering closely with development and product leaders. Proven ability to develop talent, build high‑performing teams, and cultivate collaboration. Who You Are Manages Complexity – You make sense of complex, high quantity, and sometimes contradictory information to effectively solve problems. Decision Quality – You make good and timely decisions that keep the organization moving forward. Optimizes Work Processes – You know the most effective and efficient processes to get things done, with a focus on continuous improvement. Builds Effective Teams – You build strong‑identity teams that apply their diverse skills and perspectives to achieve common goals. Strategic Mindset – You see ahead to future possibilities and translate them into breakthrough strategies. All Company Core Competencies Customer Focus: You build strong customer relationships and deliver customer‑centric solutions. Cultivates Innovation: You create new and better ways for the organization to be successful. Collaborates: You build partnerships and work collaboratively with others to meet shared objectives. Instills Trust: You gain the confidence and trust of others through honesty, integrity, and authenticity. Self‑Development: You actively seek new ways to grow and be challenged using both formal and informal development channels. Develops Talent (Mgmt Only): You develop people to meet both their career goals and the organization’s goals. #J-18808-Ljbffr



  • (s): Canada : Ontario : Toronto Scotiabank Global Site Full time

    Requisition ID: 244027Join a purpose driven winning team, committed to results, in an inclusive and high-performing culture.Overview: As a Site Reliability Engineer (SRE), you will join the Digital Engineering Operations team, responsible for ensuring the operations and reliability of Scotiabank digital applications. You will have the opportunity to drive...


  • (s): Canada : Ontario : Toronto Scotiabank Global Site Full time

    Requisition ID: 244026Join a purpose driven winning team, committed to results, in an inclusive and high-performing culture.Overview: As a Site Reliability Engineer (SRE), you will join the Digital Engineering Operations team, responsible for ensuring the operations and reliability of Scotiabank digital applications. You will have the opportunity to drive...


  • , , Canada SPECTRAFORCE Full time

    Job Title: DevOps/Site Reliability Engineer Duration: 12+ months Core hours of the position: somewhat flexible, but able to attend meetings and collaborate with team members between 8 am Pacific and 3 pm Pacific. Team members are located in Pacific, Mountain, Central, and East time zones Top 3 items to see on resumes 5+ years of experience in DevOps, Site...


  • Canada SPECTRAFORCE Full time

    Job Title: DevOps/Site Reliability Engineer Duration: 12+ months Locations: Ontario, Toronto, Vancouver, Montreal (100% remote) Core hours of the position: somewhat flexible, but able to attend meetings and collaborate with team members between 8 am Pacific and 3 pm Pacific. Team members are located in Pacific, Mountain, Central, and East time zones Top 3...


  • Canada SPECTRAFORCE Full time

    Job Title: DevOps/Site Reliability Engineer Duration: 12+ months Locations: Ontario, Toronto, Vancouver, Montreal (100% remote) Core hours of the position: somewhat flexible, but able to attend meetings and collaborate with team members between 8 am Pacific and 3 pm Pacific. Team members are located in Pacific, Mountain, Central, and East time zones Top...


  • Canada SPECTRAFORCE Full time

    Job Title: DevOps/Site Reliability Engineer Duration: 12+ months Locations: Ontario, Toronto, Vancouver, Montreal (100% remote) Core hours of the position: somewhat flexible, but able to attend meetings and collaborate with team members between 8 am Pacific and 3 pm Pacific. Team members are located in Pacific, Mountain, Central, and East time zones Top 3...


  • Canada SPECTRAFORCE Full time

    Job Title: DevOps/Site Reliability Engineer Duration: 12+ months Locations: Ontario, Toronto, Vancouver, Montreal (100% remote) Core hours of the position: somewhat flexible, but able to attend meetings and collaborate with team members between 8 am Pacific and 3 pm Pacific. Team members are located in Pacific, Mountain, Central, and East time zones Top 3...


  • Canada Blue Signal Search Full time

    Site Reliability Engineer Location: Remote, Canada Our client is a fast-growing provider of AI-driven edge-computing platforms that keep industrial operations safe, smart, and always on. Their distributed hardware and software suite processes high-volume video and sensor data at the edge, delivering real-time insight for customers who cannot afford downtime....


  • Canada Blue Signal Search Full time

    Site Reliability Engineer Location: Remote, Canada Our client is a fast-growing provider of AI-driven edge-computing platforms that keep industrial operations safe, smart, and always on. Their distributed hardware and software suite processes high-volume video and sensor data at the edge, delivering real-time insight for customers who cannot afford downtime....


  • Canada Blue Signal Search Full time

    Site Reliability Engineer Location: Remote, Canada Our client is a fast-growing provider of AI-driven edge-computing platforms that keep industrial operations safe, smart, and always on. Their distributed hardware and software suite processes high-volume video and sensor data at the edge, delivering real-time insight for customers who cannot afford downtime....