Site Reliability Engineer

6 days ago


Canada CENGN - (Centre of Excellence in Next Generation Networks) Full time

Join to apply for the Site Reliability Engineer role at CENGN - (Centre of Excellence in Next Generation Networks) Use of AI in Hiring: No, we do not use AI in screening/selection. Vacancy Status: This posting is for an existing vacancy. We are hiring for 1 position. About Us CENGN is Canada’s Centre of Excellence in Next Generation Networks. Our mission is to drive innovation and adoption of advanced networking technologies in Canada through our Living Labs and advanced networking infrastructure, technical expertise, talent development, and partner ecosystem, enabling the digital transformation and competitiveness of Canadian industry and the commercial growth of Canadian digital technology solutions. With the digital transformation opportunity valued at over $200 billion in Canada, it is clear the new competitive landscape is being driven by digital innovation and the ability to integrate this technology across industries. Join our team, as we work with our ecosystem of technology, innovation, government, and academic partners to build Living Lab testing infrastructure and deliver services that accelerate the testing, validation, demonstration, commercialization and adoption of digital innovation across Canada. The CENGN Advantage Work where you work best: Remote environment to suit your individual professional and personal needs Career Development: An agile company in a modern setting where your ideas and opportunities for growth are nurtured and encouraged Canadian Innovation Support: Be part of an organization that drives innovation by providing Canadian start‑ups and scale‑ups as well as tech students and professionals the ability to succeed Great People: The advantage of working with colleagues passionate about their contributions and united under the same mission Attractive and Competitive Group Benefit Plan Phone plan reimbursement Employer paid RSP contribution with no matching requirement Wellness and Development Annual fitness allowance Wellness webinars, lunch and learns, and social events Vacation and Time Off Three weeks vacation plus personal and sick days Annual Christmas shutdown The Opportunity We are seeking a Site Reliability Engineer (SRE) with a balanced background in data center operations, DevOps methodologies, and cloud‑native practices. In this role, you will be responsible for maintaining the reliability, scalability, and efficiency of both on‑premises and cloud‑based infrastructure, ensuring seamless service delivery and operational excellence. Key Responsibilities Data Center Operations: Manage and maintain hardware infrastructure, including servers, networking, and storage systems. Perform routine maintenance, hardware replacements, and capacity planning. Oversee power and cooling requirements, ensuring optimal environmental conditions. Implement and maintain structured cabling and rack management best practices. Work with vendors for hardware procurement, support contracts, and troubleshooting hardware failures. Ensure proper monitoring is in place for all devices in the DC using SNMP, telemetry, and log aggregation. Establish a robust monitoring funnel to collect, process, and analyze metrics from network switches, storage systems, and compute resources, enabling proactive issue resolution and performance optimization. Automation & Infrastructure as Code (IaC): Develop automation scripts using Terraform, Ansible, or similar tools to manage infrastructure. Cloud‑Native & Hybrid Deployments: Design, implement, and maintain Kubernetes‑based workloads across on‑prem and cloud environments (AWS, Azure, or GCP). Monitoring & Incident Response: Implement observability solutions using tools like Prometheus, Grafana, Datadog, or ELK. Respond to incidents, perform root cause analysis, and drive post‑mortems. CI/CD Pipeline Management: Develop and optimize CI/CD pipelines using Jenkins, GitLab CI, or ArgoCD to improve software deployment processes. Security & Compliance: Implement security best practices for both on‑prem and cloud environments, including access controls, patching, and vulnerability management. Performance Optimization: Analyze and optimize system performance, focusing on scalability and high availability. Collaboration & Documentation: Work closely with developers, operations teams, and security engineers. Maintain comprehensive documentation for systems, processes, and procedures. Key Competencies/Qualifications Experience: 3-5 years in SRE, DevOps, or Systems Engineering roles. Data Center Knowledge: Hands‑on experience with racking, cabling, troubleshooting hardware, and managing networking equipment. Linux Administration: Proficiency in managing Linux‑based systems (Ubuntu, RHEL, or CentOS). Cloud Expertise: Experience working with AWS, Azure, or GCP, including networking, compute, and storage services. Container Orchestration: Hands‑on experience with Kubernetes and containerized workloads (Docker, Helm, Istio, etc.). Automation & Scripting: Strong experience with scripting languages (Bash, Python, or Go) and configuration management tools (Ansible, Puppet, or Chef). CI/CD & GitOps: Experience with tools like Jenkins, GitLab CI, or ArgoCD for automated deployments. Observability & Logging: Knowledge of monitoring tools such as Prometheus, Grafana, Loki, ELK, or Datadog. Strong IT operational background with day‑to‑day responsibility for servers, networks, monitoring, and incident response. Hands‑on experience with virtual networking , bonded interfaces, VLAN segmentation, VRFs, and templated VM rollouts. Hands‑on experience administering Proxmox hypervisors for multi‑tenant VM deployments. Preferred Qualifications Networking Knowledge: Understanding of TCP/IP, VPNs, firewalls, and SDN concepts. Disaster Recovery & Backup Strategies: Experience implementing robust backup and failover strategies. Ansible Advanced Usage: Expertise in developing scalable and reusable Ansible playbooks for infrastructure automation. Advanced Kubernetes Administration: Experience with Kubernetes security policies, RBAC, service mesh implementations, and multi‑cluster management. Practical experience operating Ceph storage clusters, including pool management and basic performance troubleshooting. Education Minimum 3 years of experience in a progressive similar role Bachelor’s degree or College Diploma in Engineering or Computer Science or related discipline Languages English (oral, reading, and writing) French (oral, reading, and writing) or any other language would be considered an asset Job Details Seniority level: Entry level Employment type: Full‑time Job function: Engineering and Information Technology Industries: IT Services and IT Consulting Referrals increase your chances of interviewing at CENGN - (Centre of Excellence in Next Generation Networks) by 2x #J-18808-Ljbffr



  • (s): Canada : Ontario : Toronto Scotiabank Global Site Full time $105,000 - $170,000 per year

    Requisition ID: 244026Join a purpose driven winning team, committed to results, in an inclusive and high-performing culture.Overview: As a Site Reliability Engineer (SRE), you will join the Digital Engineering Operations team, responsible for ensuring the operations and reliability of Scotiabank digital applications. You will have the opportunity to drive...


  • (s): Canada : Ontario : Toronto Scotiabank Global Site Full time US$80,000 - US$140,000 per year

    Requisition ID: 244027Join a purpose driven winning team, committed to results, in an inclusive and high-performing culture.Overview: As a Site Reliability Engineer (SRE), you will join the Digital Engineering Operations team, responsible for ensuring the operations and reliability of Scotiabank digital applications. You will have the opportunity to drive...


  • , , Canada Orion Innovation Full time

    Senior Site Reliability Engineer (SRE) with Kubernetes & Rancher Location: Canada - Remote (Working EST hours) Job Type: Full-time About the Role Are you an exceptional Site Reliability Engineer with a passion for building and maintaining highly resilient and secure systems? We are seeking a Senior SRE to join our team and play a critical role in managing...


  • , , Canada Thinkific Full time

    Join to apply for the Senior Site Reliability Engineer role at Thinkific Join to apply for the Senior Site Reliability Engineer role at Thinkific Are you an experienced Site Reliability Engineer looking for a new challenge? We’re looking for a Senior Site Reliability Engineer to join us at Thinkific. We’re looking for a Senior Site Reliability Engineer...


  • (s): Canada : Ontario : Toronto Scotiabank Global Site Full time $120,000 - $180,000 per year

    Requisition ID: 239640Join a purpose driven winning team, committed to results, in an inclusive and high-performing culture.The RoleAs a member of the Systems Reliability Engineering team, the System Reliability Engineer will collaborate closely with Engineering and development teams, peers, and business partners to continuously improve the stability,...


  • , , Canada Icon Full time

    Helping SaaS companies scale Engineering teams. Director, Site Reliability Engineering We are seeking an accomplished Director of Site Reliability Engineering (SRE) to lead the reliability, scalability, and performance initiatives across multiple enterprise technology domains, including AML, Risk, Finance, Corporate Treasury, and Human Resources systems....


  • , , Canada Orion Innovation Full time

    Job Description: Senior Site Reliability Engineer (SRE) with Kubernetes & Rancher Location: Canada - Remote (Working EST hours) Job Type: Full-time About the Role Are you an exceptional Site Reliability Engineer with a passion for building and maintaining highly resilient and secure systems? We are seeking a Senior SRE to join our team and play a critical...


  • , , Canada Akamai Technologies Full time

    Senior Site Reliability Engineer Join Akamai Technologies as we build a reliable, secure, and scalable Internet. We are looking for a Senior Site Reliability Engineer to help us solve complex performance and reliability challenges. Job Description Are you passionate about cutting‑edge technology and ready to tackle some of the Internet’s most difficult...


  • , , Canada Targeted Talent Full time

    Overview We are looking for an experienced Senior Site Reliability Engineer for our client. This is a permanent position that is remote to start with later relocation to Calgary or Winnipeg . Our client is a global enterprise company with a product that you've likely used. Experience with coding/software development, along with Site Reliability will be the...


  • , , Canada DuckDuckGo Full time

    6 days ago Be among the first 25 applicants Get AI-powered advice on this job and more exclusive features. Who We AreHi, we're DuckDuckGo, the online protection company and remote-first team of 300+ on a mission to raise the standard of trust online. Founded in 2008 and profitable since 2014, our annual revenue now exceeds $100 million USD. Millions use our...