Site Reliability Engineer

5 days ago


Montreal, Canada High Tech Genesis Inc. Full time

WE'RE HIRING At HTG, you’ll push boundaries with the latest tech and collaborate with a team that loves what they do. Be part of a design services company that is amongst the companies that lead the world in technology and innovation. Your next chapter starts here. In this role, you will: • Act as the main technical escalation point for first-level operations analysts across cloud, network, and connected device environments. • Lead advanced troubleshooting, service restoration, and fault isolation for critical incidents, collaborating with engineering teams when required. • Own and manage problem records by conducting detailed root cause analyses, documenting preventive actions, and tracking issue resolution through completion. • Prepare and distribute clear and timely communication for customer-facing incident updates and internal post-incident summaries. • Identify manual and repetitive operational work and replace it with automated solutions through scripts, scheduled jobs, or self-healing workflows. • Define operational data requirements and contribute to refining AI and automation models used in incident management. • Establish and maintain performance metrics and service objectives; improve monitoring and reliability through better instrumentation and observability. • Implement safeguards and resilience mechanisms within operational systems, while promoting a culture of continuous learning and blameless retrospectives. • Maintain and enhance monitoring tools, alerting systems, dashboards, and operational documentation supporting 24/7 availability. • Tune monitoring thresholds and notifications to reduce noise and ensure only meaningful alerts are surfaced for action. • Ensure complete visibility across systems through metrics, logs, and traces for effective diagnostics and performance tracking. • Participate in operational readiness reviews and evaluate risk, rollback plans, and change impact before scheduled deployments. • Coordinate deployments and maintenance windows, performing verification steps before and after updates. • Track and improve deployment reliability and change success rates through post-release reviews and metrics. • Manage and operate cloud resources including compute, storage, networking, and identity, following least-privilege and compliance principles. • Support observability, access control, and governance standards within the cloud environment, including cost visibility and tagging policies. • Oversee integrations with hybrid infrastructure, including connectivity, certificates, and internal networking components. • Develop, maintain, and continuously improve operational documentation such as standard procedures, runbooks, and escalation workflows. • Ensure the accuracy, version control, and completeness of all operational knowledge materials. • Utilize ticketing and workflow systems for managing incidents, problems, and changes, while maintaining visibility into service performance. • Collaborate with engineering and DevOps teams to incorporate operational needs into design and deployment processes. • Provide training and mentorship to junior analysts, improving first-contact resolution rates and technical skill depth. • Communicate effectively with internal teams and external partners regarding incidents, maintenance updates, and service improvements. • Uphold security best practices in daily operations, including patch management, credential hygiene, and access reviews. • Work with compliance and security teams to address vulnerabilities, audits, and control assessments. • Participate in a shared on-call rotation and scheduled maintenance periods, ensuring smooth handovers and consistent shift documentation. • The on-call rotation will initially involve 3 to 4 team members, progressing toward full 24/7 coverage as the team expands. Required skills and experience: • At least 3 years of experience in network operations, site reliability, or cloud platform support roles managing production systems • Strong understanding of networking, VPNs, firewalls, load balancers, DNS, and certificate management • Hands-on experience with cloud services including compute, storage, networking, and identity management • Practical experience with both Linux and Windows systems administration • Proficiency in one or more scripting languages such as Python, PowerShell, or Bash, and ability to create dependable automation workflows • Familiarity with monitoring, alerting, and telemetry systems, including the design of meaningful service-level indicators. • Working knowledge of service management platforms and workflow automation tools. • Proven ability to write accurate operational documentation, including procedures and troubleshooting guides • Strong communication skills for both technical and customer-facing interactions Preferred Qualifications: • Experience with Infrastructure-as-Code tools (e.g., Terraform, Bicep) and CI/CD systems • Knowledge of IoT or distributed device management at scale • Understanding of system reliability concepts such as graceful degradation and autoscaling • Exposure to industrial or energy systems involving telemetry, control, or gateway operations • Relevant certifications such as Azure Administrator, Azure Network Engineer, ITIL, or CCNA (or equivalents) High Tech Genesis Inc. is an Equal Opportunity Employer. Diversity and inclusion are at the core of our values. Please advise High Tech Genesis of any accommodation measures you may require. Please be advised: 1. Applicants must have the legal right to work in Canada. 2. Kindly submit your resume in MS Word format upon application for this position.



  • Montreal, Canada ApTask Full time

    Direct message the job poster from ApTask Looking for an intermediate between 2 to 5 years' experience. The Application Infrastructure (Al) department is seeking a Site Reliability Engineer (SRE) to help drive the reliability engineering, operations and customer support services clients ServiceNow SaaS implementation. Reporting to a Site Reliability...


  • Montreal, Canada ApTask Full time

    Direct message the job poster from ApTaskLooking for an intermediate between 2 to 5 years' experience.The Application Infrastructure (Al) department is seeking a Site Reliability Engineer (SRE) to help drive the reliabilityengineering, operations and customer support services clients ServiceNow SaaS implementation.Reporting to a Site Reliability Engineering...


  • Montreal, Canada Compunnel Inc. Full time

    Site Reliability Engineer – KUMDC5681698 Long Term Contract The Application Infrastructure (AI) department is seeking a Site Reliability Engineer (SRE) to drive reliability engineering, operations, and customer support services for ServiceNow SaaS implementation. Reporting to the Site Reliability Engineering & Operations Lead, this role involves delivering...


  • Montreal, Canada Compunnel Inc. Full time

    Site Reliability Engineer – KUMDC Long Term Contract The Application Infrastructure (AI) department is seeking a Site Reliability Engineer (SRE) to drive reliability engineering, operations, and customer support services for ServiceNow SaaS implementation. Reporting to the Site Reliability Engineering & Operations Lead, this role involves delivering SRE...


  • Montreal, Canada Open Systems Technologies Full time

    Site Reliability Engineer (SRE), ServiceNow, Application Infrastructure Location: Montreal – Hybrid – 3 days/week The Application Infrastructure (AI) department is seeking a Site Reliability Engineer (SRE) to help drive reliability engineering, operations and customer support services for client’s ServiceNow SaaS implementation. Reporting to a Site...


  • Montreal, Canada Open Systems Technologies Full time

    Site Reliability Engineer (SRE), ServiceNow, Application Infrastructure Location: Montreal – Hybrid – 3 days/week The Application Infrastructure (AI) department is seeking a Site Reliability Engineer (SRE) to help drive reliability engineering, operations and customer support services for client’s ServiceNow SaaS implementation. Reporting to a Site...


  • Montreal, Quebec, Canada Open Systems Technologies Full time

    Job Title: Site Reliability EngineerLocation: Montreal – Hybrid – 3 days/weekTerm: 12 months contract plus extensionThe Application Infrastructure (AI) department is seeking a Site Reliability Engineer (SRE) to help drive reliability engineering, operations and customer support services for client's ServiceNow SaaS implementation. Reporting to a Site...


  • Montreal, Quebec, Canada Roshan Consulting Services Full time

    Company DescriptionRoshan Consulting empowers businesses to optimize operations and enhance efficiency through innovative strategies and technologies tailored to their unique needs. Our mission is to drive digital transformation and deliver sustainable growth by offering services such as Robotic Process Automation (RPA), business process optimization, and...


  • Montreal, Canada Compunnel, Inc. Full time

    Client is seeking an experienced Site Reliability Engineer (SRE) to support and enhance the reliability, performance, and operational efficiency of our global ServiceNow SaaS platform. As part of the Application Infrastructure (AI) team, you will be instrumental in advancing SRE practices, ensuring seamless integration and stability across on-premise...


  • Montreal, Canada Compunnel, Inc. Full time

    Client is seeking an experienced Site Reliability Engineer (SRE) to support and enhance the reliability, performance, and operational efficiency of our global ServiceNow SaaS platform. As part of the Application Infrastructure (AI) team, you will be instrumental in advancing SRE practices, ensuring seamless integration and stability across on-premise...