Site Reliability Engineer

9 hours ago


Toronto, Ontario, Canada Maneva Full time US$80,000 - US$120,000 per year
About Maneva

Maneva builds and deploys edge AI solutions powering real-time intelligence for industrial environments. Our systems run on distributed edge compute devices (NVIDIA Jetson platforms), integrate with local network cameras, PLCs, sensors, and other on-premise equipment, and securely communicate with cloud services via client- or site-based VPNs. Our customers rely on our systems around the clock, and we take reliability seriously.

We're seeking a Site Reliability Engineer (SRE) who enjoys solving complex operational challenges, improving observability and automation, and supporting mission-critical workloads in production.

About the Role

As a Site Reliability Engineer at Maneva, you will ensure the reliability, availability, and performance of our edge AI deployments at customer sites. This includes gaining deep familiarity with Maneva's hardware platform, networking configurations, and application stack so that you can rapidly diagnose and resolve issues as they arise.

The role includes participating in an on-call rotation for 24/7 incident response, including off-hour coverage as part of a structured global support model. When not responding to incidents, you will contribute to long-term engineering initiatives around monitoring, automation, reliability, and documentation.

Responsibilities

Operational Support & Incident Response

  •  Serve as a first responder for production issues, alarms, and system outages (24/7 rotation required).
  •  Troubleshoot Linux system issues, hardware problems, networking connectivity, and edge-device performance.
  •  Perform root-cause analysis (RCA) and implement corrective and preventive solutions.
    Document incidents, contributing to a culture of transparency and process improvement.

Proactive Monitoring & Observability

  •  Build and maintain robust monitoring dashboards and alerts using Prometheus, Grafana, and similar tools.
  •  Continuously improve observability, including metrics, logs, traces, and health checks.
  •  Analyze trends to proactively identify reliability risks before incidents occur.
  •  Develop automation to reduce noise and improve actionable alert quality.

Systems Reliability & DevOps Engineering

  •  Improve deployment workflows, CI/CD pipelines, configuration management, and automated provisioning.
  •  Create tools and scripts in Python/Bash to streamline operational processes.
  •  Contribute to load testing, system validation, and network health verification for edge deployments.
  •  Implement best practices for secure, scalable, and maintainable infrastructure.

Infrastructure & Application Ownership

  •  Understand and operate Maneva's end-to-end edge AI stack:
  •  Jetson/embedded Linux systems
  •  GPU-accelerated workloads for computer vision
  •  Video pipelines (RTSP, camera interfaces, data ingestion)
  •  Local integrations (PLCs, industrial hardware, APIs, network resources)
  •  VPN-based connectivity (client-based or site-to-site)
  •  Maintain visibility into device health and fleet-wide system performance.

Documentation & Process Development

  •  Create and maintain SOPs for on-site customer teams and internal engineering workflows.
  •  Produce detailed incident reports and reliability documentation.
  •   Maintain internal knowledge bases, troubleshooting guides, and playbooks.

Requirements

Technical Skills

  •  Strong Linux systems administration experience (Ubuntu, embedded Linux, ARM systems).
  •  Proficiency in Python and/or Bash for scripting and operations automation.
  •  Solid networking fundamentals: TCP/IP, routing, DNS, DHCP, VPNs, VLANs, firewall rules.
  •  Familiarity with troubleshooting tools: tcpdump, nmap, iftop, netstat, etc.
  •  Hands-on experience with Prometheus, Grafana, or similar monitoring/alerting platforms.
  •  Experience with logging/observability stacks (ELK/EFK, Loki, Fluentd, etc.) is a plus.
  •  Experience with Docker or containerized applications is desirable.
  •  Comfort supporting distributed or remote device fleets.

Soft Skills

  •  Excellent diagnostic and analytical abilities under pressure.
  •  Strong communication skills with both technical and non-technical stakeholders.
  •  High ownership mentality and ability to follow issues through to resolution.
  •  Comfortable working independently in a fully remote environment.
  •  Willingness to participate in on-call rotation, including off-hours and weekends.
Preferred Qualifications
  • Experience supporting machine learning, computer vision, or GPU-accelerated systems.
  • Familiarity with NVIDIA Jetson or other embedded AI hardware.
  • Prior SRE/DevOps/Systems Engineer experience in a 24/7 operational environment.
  • Exposure to industrial IoT, manufacturing systems, or operational technology (OT).
  • Experience writing customer-facing operational documentation or SOPs.

Benefits

What We Offer
  •  Fully remote work environment with flexibility (within on-call requirements).
  •  Opportunities to work with cutting-edge edge compute and AI deployments.
  •  A high-impact role shaping reliability practices from early stages.
  •  Contract or full-time options, with competitive compensation.
  •  A collaborative team committed to transparency, improvement, and excellence.


  • Toronto, Ontario, Canada Procom Full time $80,000 - $120,000 per year

    Site Reliability Engineer (SRE)/ Ingénieur Fiabilité des SitesOn behalf of our banking client, Procom is seeking a Site Reliability Engineer (SRE) for a 12-month contract role. This position is a hybrid role, 3 days a week onsite at our client's Montréal, Quebec office.Site Reliability Engineer - Job Description:The Site Reliability Engineer is...


  • Toronto, Ontario, Canada Tekgence Inc Full time $80,000 - $120,000 per year

    Hello,Please find the Job Description belowSite Reliability Engineering (SRE)Toronto ONSkills Required: Digital : Python Digital : Google Cloud Digital : Site Reliability Engineering (SRE)Job Description:Gather and analyze metrics from operating systems as well as applications to assist in performance tuning and fault findingPartner with development teams to...


  • Toronto, Ontario, Canada Tecsys Inc. Full time $85,000 - $130,000 per year

    Having recognized the advantages of remote work, including employee morale, productivity, reduced commuting on employee wellbeing and the environment, we are proud to be a digital-first company. The technologies and programs in which we invested have provided a fantastic foundation to this end. Our digital-first work environment, together with our...


  • Toronto, Ontario, Canada Apptoza Inc. Full time $30,000 - $120,000 per year

    HI,Hope you are doing Great,If you are fine with below JD please share me your Updated resume ASAP.Site Reliability EngineerLocation: TORONTO (ONSITE)Duration: 6 monthsExp Required: 10 YearsJob Description: Job Title : SRETechnical/Functional Skills• 8+ years of overall IT experience.• Advanced Linux / Unix support experience required.• Strong shell...


  • Toronto, Ontario, Canada Xplor Full time $125,000 - $150,000

    Company Description Take a seat on the Xplor rocketship and join us as Site Reliability Engineer to help people succeed across the world.From dropping your kids off at childcare, getting something at home repaired, going to the gym or a fitness studio, to picking up your dry cleaning — our software, payments, and commerce-enabling solutions help everyday...


  • Toronto, Ontario, Canada Pixomondo Full time $120,000 - $180,000 per year

    We're seeking an experienced Site Reliability Engineer to join our team and lead infrastructure automation, CI/CD workflows, and deployment operations for a custom web platform. You'll be working with a modern DevOps stack including GitHub Actions, GCP, Kubernetes, Terraform, PostgreSQL, CodeDeploy, and Cloudflare to ensure our platform is robust, scalable,...


  • Toronto, Ontario, Canada Kablamo Full time $90,000 - $120,000 per year

    Reports to: Technical Support ManagerLocation: Toronto (Hybrid)Role Type: Full timeLevel: Intermediate/MidIntroductionKablamo is a fast-growing cloud digital product development company. Founded in 2017 in Australia, the business has grown quickly over the last several years, including the expansion of the team to Canada in 2021. We are proud to have...


  • Toronto, Ontario, Canada McCain Foods Full time $102,700 - $137,000 per year

    Position Title:Site Reliability EngineerPosition Type:Regular - Full-TimePosition Location:Toronto HQRequisition ID:36904Our Global Technology team's goal is to leverage technology and data to drive profitable growth, focus on enhancing customer experience and to further our purpose of 'Celebrating real connections through delicious, planet-friendly food'....


  • Toronto, Ontario, Canada AceStack Full time $120,000 - $200,000 per year

    Job Title: Lead Site Reliability Engineer – Banking Domain (Wealth Management Preferred)Location: Toronto Downtown, ON (Onsite – 5 Days/Week)Duration: ContractExperience: 14+ YearsAbout the Role:We are looking for a highly skilled Site Reliability Engineering (SRE) Lead with a strong background in the Banking domain, ideally within Wealth Management. The...


  • Toronto, Ontario, Canada AstraNorth Full time $90,000 - $120,000 per year

    Site Reliability Engineer (SRE) with expertise in Dynatrace monitoring, log investigation, and observability practices. The ideal candidate will have a deep understanding of business processes, upstream-downstream dependencies, and the ability to design and implement dashboards with SLOs and SLAs that align with business objec-tives.Key...