Site Reliability Engineer

4 days ago


Montréal QC HM H, Canada Atlantis IT Group Full time $120,000 - $180,000 per year

Role - Site Reliability Engineer (SRE /GenAI Infrastructure / Kubernetes / IaC)

Location - Montreal, QC

Production experience in SRE / Infrastructure / ops for large-scale systems

Strong programming/scripting skills (Python, Go, Java, or equivalent)

Deep experience with containerization (Docker), orchestration (Kubernetes, etc.)

Infrastructure-as-code (Terraform, Helm, CloudFormation, Ansible, etc.)

Familiarity with GPU / AI compute clusters, high-performance data storage, and distributed architectures

Experience with monitoring / observability / logging / alerting tools (Prometheus, Grafana, ELK / EFK, Datadog, etc.)

Networking & systems engineering knowledge (TCP/IP, DNS, routing, load bal-ancing, distributed storage)

Solid experience in capacity planning, performance tuning, scaling, and incident response

Demonstrated ability to lead RCAs, deploy fixes, and drive reliability improve-ments

Experience in regulated environments (financial services, compliance, audit, se-curity) is a strong plus

Excellent communication, documentation, and cross-team collaboration skills

Proven track record of reducing operational toil via automation

Experience: 8+ years of experience as a Site Reliability Engineer or in a similar role, with hands-on experience in supporting IaaS platforms with networking and system engineer-ing knowledge.

Roles and Responsibilities:

Operate, monitor, and maintain the infrastructure supporting GenAI applications (training, inference, feature store, data ingestion, model serving)

Design and build automation for core platform capabilities, reducing manual toil

Develop and maintain infrastructure-as-code (IaC) for provisioning and managing compute, storage, network, GPU clusters, Kubernetes / container orchestration, etc.

Establish, monitor, and enforce SLOs/SLIs/SLAs, error budgets, alerting, and dashboards

Lead incident response, root cause analysis (RCA), postmortems, and systemic remediation

Perform capacity planning, scaling strategies, workload scheduling, and resource forecasting

Optimize cost vs. performance tradeoffs in large-scale compute environments

Harden systems for security, compliance, auditability, and data governance

Collaborate across teams (cloud engineers, data engineers, infrastructure, secu-rity) to ensure safe deployment, rollout, rollback, and integration of new systems

Define disaster recovery (DR) strategies, backup/restore practices, fault toler-ance mechanisms

Maintain runbooks, operational playbooks, documentation, and training materials

Participate in on-call rotations and respond to production incidents 24/7 as needed

Continuously evaluate and integrate new tools, frameworks, or technologies to enhance platform reliability



  • Montréal, QC, Canada Compunnel Inc. Full time

    Job Title : Site Reliability Engineer (SRE) Experience : 7-15 years Location : Montreal (Day 1 onboarding onsite / in office presence 3x week Skills required: • The ideal candidate would have at least one of: Software development skills in one or more programming language, e.g. Python, ServiceNow administration or development experience, 7 + years of...


  • Montréal, QC, Canada Compunnel Inc. Full time

    Job Title : Site Reliability Engineer (SRE) Experience : 7-15 years Location : Montreal (Day 1 onboarding onsite / in office presence 3x week Skills required: • The ideal candidate would have at least one of: Software development skills in one or more programming language, e.g. Python, ServiceNow administration or development experience, 7 + years of...


  • Montréal, QC, Canada Compunnel Inc. Full time

    Job Title : Site Reliability Engineer (SRE) Experience : 7-15 years Location : Montreal (Day 1 onboarding onsite / in office presence 3x week Skills required: • The ideal candidate would have at least one of: Software development skills in one or more programming language, e.g. Python, ServiceNow administration or development experience, 7 + years of...


  • Montréal, QC, Canada LanceSoft, Inc. Full time

    Job Title: Site Reliability Engineer Experience Level: Level 4 (advanced): 7-15 years Location: Montreal (Day 1 onboarding onsite / in office presence 3x week) Duration: 12+ months contract Primary Responsibilities: Provide L3 support for ***'s private cloud, including on-call rotation Work closely with the internal engineering team and provide input on...


  • Montréal, QC, Canada LanceSoft, Inc. Full time

    Job Title: Site Reliability Engineer Experience Level: Level 4 (advanced): 7-15 years Location: Montreal (Day 1 onboarding onsite / in office presence 3x week) Duration: 12+ months contract Primary Responsibilities: Provide L3 support for ***'s private cloud, including on-call rotation Work closely with the internal engineering team and provide input on...


  • Montréal, QC, Canada LanceSoft, Inc. Full time

    Job Title: Site Reliability Engineer Experience Level: Level 4 (advanced): 7-15 years Location: Montreal (Day 1 onboarding onsite / in office presence 3x week) Duration: 12+ months contract Primary Responsibilities: Provide L3 support for ***'s private cloud, including on-call rotation Work closely with the internal engineering team and provide input on...


  • Montréal, QC, Canada Compunnel Inc. Full time

    Job Title: Site Reliability Engineer Location: Montreal (Day 1 onboarding onsite / in office presence 3x week) Required Skills: • 5 to 10 years of relevant experience • 3 to 5 years of Linux experience • Experience in front and back-end development with Golang • Sound knowledge of server infrastructure, virtualization, cloud computing • Proven...


  • Montréal, QC, Canada Compunnel Inc. Full time

    Job Title: Site Reliability Engineer Location: Montreal (Day 1 onboarding onsite / in office presence 3x week) Required Skills: • 5 to 10 years of relevant experience • 3 to 5 years of Linux experience • Experience in front and back-end development with Golang • Sound knowledge of server infrastructure, virtualization, cloud computing • Proven...


  • Montréal, QC, Canada Compunnel Inc. Full time

    Job Title: Site Reliability Engineer Location: Montreal (Day 1 onboarding onsite / in office presence 3x week) Required Skills: • 5 to 10 years of relevant experience • 3 to 5 years of Linux experience • Experience in front and back-end development with Golang • Sound knowledge of server infrastructure, virtualization, cloud computing • Proven...


  • Montréal, Qc, Canada Compunnel Inc. Full time

    Job Title: Site Reliability Engineer Location: Montreal (Day 1 onboarding onsite / in office presence 3x week) Required Skills: • 5 to 10 years of relevant experience • 3 to 5 years of Linux experience • Experience in front and back-end development with Golang • Sound knowledge of server infrastructure, virtualization, cloud computing • Proven...