Head of Site Reliability Engineering
13 hours ago
At Shakudo, we are building the world's first operating system for data and AI. We use the term operating system in the truest sense of the word. Like iOS, Windows and Linux, Shakudo's end-to-end OS offers ever-evolving, automatically operated, best-of-breed open-source components tailored to each business's unique needs.
The Role
We are hiring a Head of Site Reliability Engineering to lead the reliability, availability, and performance strategy of our platform. This role is ideal for someone who thrives on solving infrastructure challenges, scaling cloud-native systems, and building high-performance teams.You will work cross-functionally with engineering, product, and customer success to make Shakudo's platform rock-solid and resilient for our customers around the world. What You'll Do
- Build and lead the SRE function at Shakudo, setting goals, technical direction, and driving team culture
- Own uptime, reliability, and incident response for our platform
- Architect scalable infrastructure using Kubernetes, cloud-native tooling, and automation frameworks
- Lead the design of observability, monitoring, and alerting systems to proactively detect and prevent issues
- Create and enforce best practices for CI/CD, disaster recovery, and service-level objectives (SLOs)
- Partner closely with engineering and product to ensure new features are reliable and production-ready
- Mentor engineers and help instill a culture of operational excellence
- 8+ years of experience in infrastructure, DevOps, or SRE roles with increasing responsibility
- Proven experience scaling distributed systems in a high-availability, production environment
- Expertise with Kubernetes, Terraform, containerization, and at least one major cloud provider (AWS preferred)
- Strong knowledge of system design, networking, and reliability principles
- Experience with observability tools (e.g., Prometheus, Grafana, Datadog) and incident response practices
- Strong leadership and communication skills, with a hands-on, collaborative approach
- Experience supporting data pipelines, ML workloads, or complex orchestration systems
- Familiarity with the data/ML tooling ecosystem (e.g., Airflow, dbt, Spark, Dremio, etc.)
- Previous experience in a startup or high-growth environment
-
Site Reliability Engineer
21 hours ago
Toronto, Ontario, Canada Procom Full time $80,000 - $120,000 per yearSite Reliability Engineer (SRE)/ Ingénieur Fiabilité des SitesOn behalf of our banking client, Procom is seeking a Site Reliability Engineer (SRE) for a 12-month contract role. This position is a hybrid role, 3 days a week onsite at our client's Montréal, Quebec office.Site Reliability Engineer - Job Description:The Site Reliability Engineer is...
-
Site Reliability Engineer
21 hours ago
Toronto, Ontario, Canada Maneva Full time US$80,000 - US$120,000 per yearAbout ManevaManeva builds and deploys edge AI solutions powering real-time intelligence for industrial environments. Our systems run on distributed edge compute devices (NVIDIA Jetson platforms), integrate with local network cameras, PLCs, sensors, and other on-premise equipment, and securely communicate with cloud services via client- or site-based VPNs....
-
Site Reliability Engineer
6 days ago
Toronto, Ontario, Canada Tecsys Inc. Full time $85,000 - $130,000 per yearHaving recognized the advantages of remote work, including employee morale, productivity, reduced commuting on employee wellbeing and the environment, we are proud to be a digital-first company. The technologies and programs in which we invested have provided a fantastic foundation to this end. Our digital-first work environment, together with our...
-
Site Reliability Engineer
19 hours ago
Toronto, Ontario, Canada Apptoza Inc. Full time $30,000 - $120,000 per yearHI,Hope you are doing Great,If you are fine with below JD please share me your Updated resume ASAP.Site Reliability EngineerLocation: TORONTO (ONSITE)Duration: 6 monthsExp Required: 10 YearsJob Description: Job Title : SRETechnical/Functional Skills• 8+ years of overall IT experience.• Advanced Linux / Unix support experience required.• Strong shell...
-
Site Reliability Engineer
19 hours ago
Toronto, Ontario, Canada Xplor Full time $125,000 - $150,000Company Description Take a seat on the Xplor rocketship and join us as Site Reliability Engineer to help people succeed across the world.From dropping your kids off at childcare, getting something at home repaired, going to the gym or a fitness studio, to picking up your dry cleaning — our software, payments, and commerce-enabling solutions help everyday...
-
Site Reliability Engineer
1 week ago
Toronto, Ontario, Canada Pixomondo Full time $120,000 - $180,000 per yearWe're seeking an experienced Site Reliability Engineer to join our team and lead infrastructure automation, CI/CD workflows, and deployment operations for a custom web platform. You'll be working with a modern DevOps stack including GitHub Actions, GCP, Kubernetes, Terraform, PostgreSQL, CodeDeploy, and Cloudflare to ensure our platform is robust, scalable,...
-
Site Reliability Engineer
6 days ago
Toronto, Ontario, Canada Kablamo Full time $90,000 - $120,000 per yearReports to: Technical Support ManagerLocation: Toronto (Hybrid)Role Type: Full timeLevel: Intermediate/MidIntroductionKablamo is a fast-growing cloud digital product development company. Founded in 2017 in Australia, the business has grown quickly over the last several years, including the expansion of the team to Canada in 2021. We are proud to have...
-
Site Reliability Engineer
18 hours ago
Toronto, Ontario, Canada McCain Foods Full time $102,700 - $137,000 per yearPosition Title:Site Reliability EngineerPosition Type:Regular - Full-TimePosition Location:Toronto HQRequisition ID:36904Our Global Technology team's goal is to leverage technology and data to drive profitable growth, focus on enhancing customer experience and to further our purpose of 'Celebrating real connections through delicious, planet-friendly food'....
-
Lead Site Reliability Engineer
1 week ago
Toronto, Ontario, Canada AceStack Full time $120,000 - $200,000 per yearJob Title: Lead Site Reliability Engineer – Banking Domain (Wealth Management Preferred)Location: Toronto Downtown, ON (Onsite – 5 Days/Week)Duration: ContractExperience: 14+ YearsAbout the Role:We are looking for a highly skilled Site Reliability Engineering (SRE) Lead with a strong background in the Banking domain, ideally within Wealth Management. The...
-
Site Reliability Engineer
6 days ago
Toronto, Ontario, Canada AstraNorth Full time $90,000 - $120,000 per yearSite Reliability Engineer (SRE) with expertise in Dynatrace monitoring, log investigation, and observability practices. The ideal candidate will have a deep understanding of business processes, upstream-downstream dependencies, and the ability to design and implement dashboards with SLOs and SLAs that align with business objec-tives.Key...