Senior Site Reliability Engineer
3 weeks ago
We are seeking a highly specialized and experienced Senior Site Reliability Engineer (SRE) to drive the reliability, performance, and automation of our core platform. This role requires an exceptional blend of deep programming expertise in both Ruby and Go , coupled with hands‑on mastery of Linux systems, advanced networking concepts (specifically IPSec), and modern container orchestration. You will be instrumental in bridging development velocity with operational stability, creating self‑healing systems, and eliminating toil across the organization. What You Will Do Engineering Reliability: Design, implement, and maintain highly available, scalable, and secure production infrastructure using Infrastructure as Code (IaC) principles. System & Tooling Development: Lead the development of new internal services, automation tools, and critical performance enhancements, leveraging Go for high‑performance systems and Ruby for platform‑specific tooling and automation. Networking & Security: Configure, manage, and troubleshoot advanced network connectivity, with a focus on IPSec tunnels and complex Linux networking for secure data transfer between environments. Observability: Build and optimize comprehensive monitoring and alerting systems using Prometheus and Grafana to define and meet critical Service Level Objectives (SLOs). Container Management: Own the deployment, lifecycle, and configuration of containerized applications using Helm to standardize releases across Kubernetes clusters. Automation: Drastically reduce operational toil by writing sophisticated scripts in Python and advanced Shell Scripting/PowerShell for operational tasks, CI/CD pipelines, and automated incident response. Incident Response: Serve as a technical escalation point in on‑call rotations, leading the effort to conduct root cause analyses (RCAs) and implementing preventative measures to ensure issues never recur. GitOps: Utilize Git workflows for all infrastructure and application configuration changes, promoting a systematic and auditable approach to system management. Required Skills & Experience Deep Technical Expertise (Must Haves) Expert Proficiency in Ruby (5+ years): Demonstrated ability to write, debug, and optimize complex services and high‑level automation/frameworks in Ruby. Expert Proficiency in Go (Golang) (3+ years): Proven experience building production‑grade infrastructure tools, microservices, or platform components in Go. Networking Security (IPSec): Deep, hands‑on experience configuring, troubleshooting, and securing VPNs/site‑to‑site connectivity using IPSec in a production environment. Container Orchestration: Expertise with Helm for packaging and deploying applications on Kubernetes (K8s). Observability: Extensive experience setting up, administering, and utilizing the Prometheus time‑series database and Grafana for data visualization and dashboard creation. Operating Systems: Mastery of the Linux operating system and its internals. Bachelor’s degree in Computer Science, Engineering, or a related field, or equivalent practical experience. Core SRE Skills - Preferred Automation & Scripting: Strong proficiency in Python and robust Shell Scripting (Bash/Zsh) or PowerShell for system automation and administration. Infrastructure as Code & Version Control: Excellent knowledge of Git for source control and significant experience with IaC tools (e.g., Terraform, Ansible) for infrastructure automation. Troubleshooting: Elite‑level debugging and performance tuning skills across the entire stack (kernel, network, application, database). Seniority level Mid‑Senior level Employment type Full‑time Job function Engineering and Information Technology Industries IT Services and IT Consulting Site Reliability Engineer (SRE) - Platform Infrastructure team (100% Remote - Canada) #J-18808-Ljbffr
-
Senior Site Reliability Engineer
2 days ago
, , Canada Thinkific Full timeJoin to apply for the Senior Site Reliability Engineer role at Thinkific Join to apply for the Senior Site Reliability Engineer role at Thinkific Are you an experienced Site Reliability Engineer looking for a new challenge? We’re looking for a Senior Site Reliability Engineer to join us at Thinkific. We’re looking for a Senior Site Reliability Engineer...
-
Senior Site Reliability Engineer
3 weeks ago
, , Canada Akamai Technologies Full timeSenior Site Reliability Engineer Join Akamai Technologies as we build a reliable, secure, and scalable Internet. We are looking for a Senior Site Reliability Engineer to help us solve complex performance and reliability challenges. Job Description Are you passionate about cutting‑edge technology and ready to tackle some of the Internet’s most difficult...
-
Senior Site Reliability Engineer
2 days ago
, , Canada DuckDuckGo Full time6 days ago Be among the first 25 applicants Get AI-powered advice on this job and more exclusive features. Who We AreHi, we're DuckDuckGo, the online protection company and remote-first team of 300+ on a mission to raise the standard of trust online. Founded in 2008 and profitable since 2014, our annual revenue now exceeds $100 million USD. Millions use our...
-
Senior Site Reliability Engineer
3 weeks ago
, , Canada Orion Innovation Full timeJob Description: Senior Site Reliability Engineer (SRE) with Kubernetes & Rancher Location: Canada - Remote (Working EST hours) Job Type: Full-time About the Role Are you an exceptional Site Reliability Engineer with a passion for building and maintaining highly resilient and secure systems? We are seeking a Senior SRE to join our team and play a critical...
-
Senior Site Reliability Engineer
3 weeks ago
, , Canada Targeted Talent Full timeOverview We are looking for an experienced Senior Site Reliability Engineer for our client. This is a permanent position that is remote to start with later relocation to Calgary or Winnipeg . Our client is a global enterprise company with a product that you've likely used. Experience with coding/software development, along with Site Reliability will be the...
-
Senior Site Reliability Engineer
2 days ago
, , Canada TextNow Full timeThis range is provided by TextNow. Your actual pay will be based on your skills and experience — talk with your recruiter to learn more. Base pay range CA$113,400.00/yr - CA$162,000.00/yr We believe communication belongs to everyone. We exist to democratize phone service. TextNow is evolving the way the world connects and that\'s because we\'re made up of...
-
Site Reliability Engineer
3 weeks ago
, , Canada Orion Innovation Full timeSenior Site Reliability Engineer (SRE) with Kubernetes & Rancher Location: Canada - Remote (Working EST hours) Job Type: Full-time About the Role Are you an exceptional Site Reliability Engineer with a passion for building and maintaining highly resilient and secure systems? We are seeking a Senior SRE to join our team and play a critical role in managing...
-
Senior Site Reliability Engineer
3 weeks ago
, , Canada TekRek Full timeThis range is provided by TekRek. Your actual pay will be based on your skills and experience — talk with your recruiter to learn more. Base pay range CA$90.00/hr - CA$120.00/hr Senior Site Reliability Engineer – Distributed Systems, Kubernetes, AWS/GCP The Company TekRek has partnered with a fast‑scaling AI infrastructure company building one of the...
-
Senior Site Reliability Engineer
3 weeks ago
, , Canada D-Wave Full timeJoin to apply for the Senior Site Reliability Engineer role at D‑Wave . D‑Wave (NYSE: QBTS) is a leader in the development and delivery of quantum computing systems, software, and services. We are the world’s first commercial supplier of quantum computers, and the only company building both annealing and gate‑model quantum computers. Our mission is...
-
Site Reliability Engineer
2 days ago
, , Canada Bitcomplete Full timeJoin us as a Senior Site Reliability Engineer to help us run an industry-scale GPU cluster via Kubernetes. Together with senior members of our team, you will combine your strong understanding of system scaling and security practices with your cloud-native expertise to stand up and maintain Kubernetes clusters from scratch. Your role will also be pivotal in...