Site Reliability Engineer
3 days ago
WE'RE HIRINGAt HTG, youll push boundaries with the latest tech and collaborate with a team that loves what they do. Be part of a design services company that is amongst the companies that lead the world in technology and innovation.Your next chapter starts here.In this role, you will: Act as the main technical escalation point for first-level operations analysts across cloud, network, and connected device environments. Lead advanced troubleshooting, service restoration, and fault isolation for critical incidents, collaborating with engineering teams when required. Own and manage problem records by conducting detailed root cause analyses, documenting preventive actions, and tracking issue resolution through completion. Prepare and distribute clear and timely communication for customer-facing incident updates and internal post-incident summaries. Identify manual and repetitive operational work and replace it with automated solutions through scripts, scheduled jobs, or self-healing workflows. Define operational data requirements and contribute to refining AI and automation models used in incident management. Establish and maintain performance metrics and service objectives; improve monitoring and reliability through better instrumentation and observability. Implement safeguards and resilience mechanisms within operational systems, while promoting a culture of continuous learning and blameless retrospectives. Maintain and enhance monitoring tools, alerting systems, dashboards, and operational documentation supporting 24/7 availability. Tune monitoring thresholds and notifications to reduce noise and ensure only meaningful alerts are surfaced for action. Ensure complete visibility across systems through metrics, logs, and traces for effective diagnostics and performance tracking. Participate in operational readiness reviews and evaluate risk, rollback plans, and change impact before scheduled deployments. Coordinate deployments and maintenance windows, performing verification steps before and after updates. Track and improve deployment reliability and change success rates through post-release reviews and metrics. Manage and operate cloud resources including compute, storage, networking, and identity, following least-privilege and compliance principles. Support observability, access control, and governance standards within the cloud environment, including cost visibility and tagging policies. Oversee integrations with hybrid infrastructure, including connectivity, certificates, and internal networking components. Develop, maintain, and continuously improve operational documentation such as standard procedures, runbooks, and escalation workflows. Ensure the accuracy, version control, and completeness of all operational knowledge materials. Utilize ticketing and workflow systems for managing incidents, problems, and changes, while maintaining visibility into service performance. Collaborate with engineering and DevOps teams to incorporate operational needs into design and deployment processes. Provide training and mentorship to junior analysts, improving first-contact resolution rates and technical skill depth. Communicate effectively with internal teams and external partners regarding incidents, maintenance updates, and service improvements. Uphold security best practices in daily operations, including patch management, credential hygiene, and access reviews. Work with compliance and security teams to address vulnerabilities, audits, and control assessments. Participate in a shared on-call rotation and scheduled maintenance periods, ensuring smooth handovers and consistent shift documentation. The on-call rotation will initially involve 3 to 4 team members, progressing toward full 24/7 coverage as the team expands.Required skills and experience: At least 3 years of experience in network operations, site reliability, or cloud platform support roles managing production systems Strong understanding of networking, VPNs, firewalls, load balancers, DNS, and certificate management Hands-on experience with cloud services including compute, storage, networking, and identity management Practical experience with both Linux and Windows systems administration Proficiency in one or more scripting languages such as Python, PowerShell, or Bash, and ability to create dependable automation workflows Familiarity with monitoring, alerting, and telemetry systems, including the design of meaningful service-level indicators. Working knowledge of service management platforms and workflow automation tools. Proven ability to write accurate operational documentation, including procedures and troubleshooting guides Strong communication skills for both technical and customer-facing interactionsPreferred Qualifications: Experience with Infrastructure-as-Code tools (e.g., Terraform, Bicep) and CI/CD systems Knowledge of IoT or distributed device management at scale Understanding of system reliability concepts such as graceful degradation and autoscaling Exposure to industrial or energy systems involving telemetry, control, or gateway operations Relevant certifications such as Azure Administrator, Azure Network Engineer, ITIL, or CCNA (or equivalents)High Tech Genesis Inc. is an Equal Opportunity Employer. Diversity and inclusion are at the core of our values.Please advise High Tech Genesis of any accommodation measures you may require.Please be advised:1. Applicants must have the legal right to work in Canada.2. Kindly submit your resume in MS Word format upon application for this position.
-
Site Reliability Engineer
4 weeks ago
Quebec (QC), Canada Compunnel Inc. Full timeJob Title: Site Reliability EngineerLocation: Montreal (Day 1 onboarding onsite / in office presence 3x week)Required Skills:• 5 to 10 years of relevant experience• 3 to 5 years of Linux experience• Experience in front and back-end development with Golang• Sound knowledge of server infrastructure, virtualization, cloud computing• Proven Kubernetes...
-
Site Reliability Engineer
3 weeks ago
Quebec, Canada ALLTECH CONSULTING SVC INC Full timeJob Description:Technology/Role/Department at our Company Enterprise Technology & Services (ETS) delivers shared technology services for the Firm supporting all business applications and end users. ETS provides capabilities for all stages of the Firm’s software development lifecycle, enabling productive coding, functional and integration testing,...
-
Site Reliability Engineer
3 weeks ago
Quebec, Canada ALLTECH CONSULTING SVC INC Full timeJob Description:Technology/Role/Department at our Company Enterprise Technology & Services (ETS) delivers shared technology services for the Firm supporting all business applications and end users. ETS provides capabilities for all stages of the Firm’s software development lifecycle, enabling productive coding, functional and integration testing,...
-
Senior Site Reliability Engineer
3 weeks ago
Quebec, Canada Orion Innovation Full timeThe Sr. SRE will be responsible for the reliability, scalability, and performance of systems supporting classified government projects in an air-gapped deployment. This role leverages advanced monitoring and DevOps tools to ensure uptime and compliance in a disconnected environment.Key ResponsibilitiesDesign and maintain highly reliable systems using RKE2,...
-
Senior Site Reliability Engineer
2 weeks ago
Quebec, Canada Orion Innovation Full timeThe Sr. SRE will be responsible for the reliability, scalability, and performance of systems supporting classified government projects in an air-gapped deployment. This role leverages advanced monitoring and DevOps tools to ensure uptime and compliance in a disconnected environment.Key ResponsibilitiesDesign and maintain highly reliable systems using RKE2,...
-
Site Reliability Engineer
3 weeks ago
Quebec, Canada ALLTECH CONSULTING SVC INC Full timeJob Description: Technology/Role/Department at our Company Enterprise Technology & Services (ETS) delivers shared technology services for the Firm supporting all business applications and end users. ETS provides capabilities for all stages of the Firm’s software development lifecycle, enabling productive coding, functional and integration testing,...
-
Site Reliability Engineer
3 weeks ago
Quebec, Canada ALLTECH CONSULTING SVC INC Full timeJob Description: Technology/Role/Department at our Company Enterprise Technology & Services (ETS) delivers shared technology services for the Firm supporting all business applications and end users. ETS provides capabilities for all stages of the Firm’s software development lifecycle, enabling productive coding, functional and integration testing,...
-
Senior Site Reliability Engineer
2 weeks ago
Quebec, Canada Orion Innovation Full timeThe Sr. SRE will be responsible for the reliability, scalability, and performance of systems supporting classified government projects in an air-gapped deployment. This role leverages advanced monitoring and DevOps tools to ensure uptime and compliance in a disconnected environment. Key Responsibilities Design and maintain highly reliable systems using RKE2,...
-
Site Reliability Engineer
1 week ago
Montréal, QC HM H, Canada Atlantis IT Group Full time $120,000 - $180,000 per yearRole - Site Reliability Engineer (SRE /GenAI Infrastructure / Kubernetes / IaC)Location - Montreal, QCProduction experience in SRE / Infrastructure / ops for large-scale systemsStrong programming/scripting skills (Python, Go, Java, or equivalent)Deep experience with containerization (Docker), orchestration (Kubernetes, etc.)Infrastructure-as-code (Terraform,...
-
Site Reliability Engineer
3 weeks ago
Montréal, QC, Canada LanceSoft, Inc. Full timeJob Title: Site Reliability Engineer Experience Level: Level 4 (advanced): 7-15 years Location: Montreal (Day 1 onboarding onsite / in office presence 3x week) Duration: 12+ months contract Primary Responsibilities: Provide L3 support for ***'s private cloud, including on-call rotation Work closely with the internal engineering team and provide input on...