Senior Site Reliability Engineer

1 week ago


Toronto Canada Hybrid Tubi Full time $120,000 - $180,000 per year
About Tubi:

Boldly built for every fandom, Tubi is a free streaming service that entertains over 100 million monthly active users. Tubi offers the world's largest collection of Hollywood movies and TV shows, thousands of creator-led stories and hundreds of Tubi Originals made for the most passionate fans. Headquartered in San Francisco and founded in 2014, Tubi is part of Tubi Media Group, a division of Fox Corporation.

About the Role:

Site Reliability Engineering (SRE) at Tubi is not a traditional operations team. We are a software engineering organization that applies a developer's mindset and toolkit to the challenges of building and running large-scale, distributed systems. Our mission is to engineer resilience from the ground up, enabling our product teams to innovate rapidly while ensuring our users have a stellar experience. We own the availability, latency, performance, and capacity of our platform, and we achieve our goals through a culture of data-driven decision-making, blameless learning, and relentless automation.

As a Senior Site Reliability Engineer, you are a hands-on engineer who blends deep software development expertise with a passion for operational excellence. You will be responsible for designing, building, and running the resilient, scalable, and increasingly self-healing systems that power our products. You will apply sound engineering principles to solve our most complex reliability challenges, with a mandate to automate everything, eliminate toil, and write robust, maintainable code. You will be a force multiplier, mentoring other engineers and elevating the site reliability bar for the entire organization.

What You'll Do:

  • System Architecture & Design: Design, build, and maintain scalable, highly available, and fault-tolerant distributed systems. Partner with development teams as a reliability consultant, reviewing designs and influencing architectural decisions to ensure new services are built with reliability, observability, and performance as core principles, not afterthoughts.
  • Automation & Software Development: Write robust, performant, and maintainable code to automate operational tasks, and CI/CD pipelines. Build the internal tools, libraries, and frameworks that enable engineering teams to self-service their observability needs, reducing cognitive load and increasing their velocity.
  • Incident Response & Post-Mortem Analysis: Participate in a 24/7 on-call rotation, acting as a key technical leader and incident commander during critical service disruptions. Conduct deep, blameless root cause analyses (RCAs) that go beyond immediate fixes to identify and address systemic issues. Drive the implementation of corrective actions to prevent the recurrence of incidents.
  • Performance & Capacity Planning: Proactively monitor, measure, and optimize system performance to ensure low latency and high efficiency. Gather and analyze metrics from operating systems and applications to assist in performance tuning and fault finding. Analyze usage patterns and historical data to forecast capacity needs, ensuring our platform stays ahead of customer demand.

Your Background:

  • Bachelor's degree in Computer Science, a related technical field, or equivalent practical experience.
  • 5+ years of professional experience in a Site Reliability Engineering, DevOps, or Software Engineering role with a focus on infrastructure and operations.
  • Strong programming proficiency in one or more high-level languages such as Rust, Go, Python, or Typescript. You should be comfortable writing, testing, and deploying production-grade code.
  • Deep knowledge of AWS services (especially networking, IAM, EKS, ALBs/NLBs, Route 53, CloudWatch). 
  • Proven experience with Kubernetes in production (EKS preferred), including service exposure, networking, and availability engineering.
  • A solid understanding of Linux/Unix operating systems, networking fundamentals (TCP/IP, DNS, HTTP), and the architecture of modern distributed systems.
Preferred Qualifications (Nice-to-Haves)
  • Experience building and managing large-scale monitoring and observability systems using tools like Datadog, Prometheus, Grafana, etc.
  • Expertise in designing and implementing CI/CD pipelines using tools such as Github action, ArgoCD, etc.
  • Experience with distributed storage technologies (e.g., Amazon S3) and databases (e.g., PostgreSQL, ScyllaDb, Clickhouse, etc.).
  • Contributions to open-source projects in the SRE, DevOps, or cloud-native ecosystem.
The AI Mandate: Building the Future of Observability with AI

As a Senior SRE, you will be at the forefront of applying AI to solve our most critical reliability challenges. This is a hands-on software development role where the "product" you build is an intelligent, automated reliability platform. Your responsibilities will include:

  • Building AI-Driven Automation: Building and integrating solutions that leverage our AIOps platform. This involves writing the code that consumes signals from the AI system, correlates disparate data sources, automates responses to AI-detected anomalies, and builds self-healing systems triggered by predictive alerts. You will transform AI insights into concrete reliability improvements.
  • Leveraging AI for Code Development: Utilizing AI-assisted coding tools (e.g., Claude Code, Cursor) as a force multiplier in your daily workflow. You will leverage these assistants to write high-quality automation scripts, Terraform modules, Kubernetes manifests, and observability dashboards faster and more efficiently, while applying your expertise to validate and refine their output.
  • Enriching our AI Knowledge Base: Developing and enriching our observability platform's internal knowledge base. You will be responsible for creating and documenting high-quality runbooks and procedural guides that can be ingested and used by AI assistants to provide context-aware troubleshooting guidance to the on-call engineer during an incident.
  • Applying Data Science to Reliability: Treating reliability as a data science problem. You will analyze vast sets of telemetry data to identify trends, build predictive models for system capacity, and proactively identify performance bottlenecks and potential failure modes before they can impact our users.

#LI-Hybrid


Tubi is a division of Fox Corporation, and the FOX Employee Benefits summarized here, covers the majority of all US employee benefits.  The following distinctions below outline the differences between the Tubi and FOX benefits:

  • For US-based non-exempt Tubi employees, the FOX Employee Benefits summary accurately captures the Vacation and Sick Time.
  • For all salaried/exempt employees, in lieu of the FOX Vacation policy, Tubi offers a Flexible Time off Policy to manage all personal matters.
  • For all full-time, regular employees, in lieu of FOX Paid Parental Leave, Tubi offers a generous Parental Leave Program, which allows parents twelve (12) weeks of paid bonding leave within the first year of birth, adoption, surrogacy, or foster placement of a child in addition to applicable government leave program(s) and FOX's short-term disability policy. This time is 100% paid through a combination of any applicable state, city, and federal leaves and wage-replacement programs in addition to contributions made by Tubi.
  • For all full-time, regular employees, Tubi offers a monthly wellness reimbursement.

We are an equal opportunity employer and all qualified applicants will receive consideration for employment without regard to race, color, religion, sex, national origin, gender identity, disability, protected veteran status, or any other characteristic protected by law. We will consider for employment qualified applicants with criminal histories consistent with applicable law.



  • Toronto - Hybrid, Canada Acquird Full time $120,000 - $160,000 per year

    A Few Notes:Profitable B2B SaaS company, teams are based out of North AmericaRole is 95% remote in Toronto (we meetup 1x a month). Must be able to legally work in Canada (visa or sponsorship won't be provided)Our Platform is growing and we are looking to hire a Senior Site Reliability Engineer (SRE) / Cloud Engineer Our main Cloud Platform is Azure (those...


  • Toronto, Canada (Hybrid) Tubi Full time $120,000 - $180,000 per year

    About Tubi:Boldly built for every fandom, Tubi is a free streaming service that entertains over 100 million monthly active users. Tubi offers the world's largest collection of Hollywood movies and TV shows, thousands of creator-led stories and hundreds of Tubi Originals made for the most passionate fans. Headquartered in San Francisco and founded in 2014,...


  • , , Canada Thinkific Full time

    Join to apply for the Senior Site Reliability Engineer role at Thinkific Join to apply for the Senior Site Reliability Engineer role at Thinkific Are you an experienced Site Reliability Engineer looking for a new challenge? We’re looking for a Senior Site Reliability Engineer to join us at Thinkific. We’re looking for a Senior Site Reliability Engineer...


  • , , Canada Akamai Technologies Full time

    Senior Site Reliability Engineer Join Akamai Technologies as we build a reliable, secure, and scalable Internet. We are looking for a Senior Site Reliability Engineer to help us solve complex performance and reliability challenges. Job Description Are you passionate about cutting‑edge technology and ready to tackle some of the Internet’s most difficult...


  • , , Canada DuckDuckGo Full time

    6 days ago Be among the first 25 applicants Get AI-powered advice on this job and more exclusive features. Who We AreHi, we're DuckDuckGo, the online protection company and remote-first team of 300+ on a mission to raise the standard of trust online. Founded in 2008 and profitable since 2014, our annual revenue now exceeds $100 million USD. Millions use our...


  • , , Canada Orion Innovation Full time

    Job Description: Senior Site Reliability Engineer (SRE) with Kubernetes & Rancher Location: Canada - Remote (Working EST hours) Job Type: Full-time About the Role Are you an exceptional Site Reliability Engineer with a passion for building and maintaining highly resilient and secure systems? We are seeking a Senior SRE to join our team and play a critical...


  • (s): Canada : Ontario : Toronto Scotiabank Global Site Full time US$80,000 - US$140,000 per year

    Requisition ID: 244027Join a purpose driven winning team, committed to results, in an inclusive and high-performing culture.Overview: As a Site Reliability Engineer (SRE), you will join the Digital Engineering Operations team, responsible for ensuring the operations and reliability of Scotiabank digital applications. You will have the opportunity to drive...


  • (s): Canada : Ontario : Toronto Scotiabank Global Site Full time $105,000 - $170,000 per year

    Requisition ID: 244026Join a purpose driven winning team, committed to results, in an inclusive and high-performing culture.Overview: As a Site Reliability Engineer (SRE), you will join the Digital Engineering Operations team, responsible for ensuring the operations and reliability of Scotiabank digital applications. You will have the opportunity to drive...


  • , , Canada Targeted Talent Full time

    Overview We are looking for an experienced Senior Site Reliability Engineer for our client. This is a permanent position that is remote to start with later relocation to Calgary or Winnipeg . Our client is a global enterprise company with a product that you've likely used. Experience with coding/software development, along with Site Reliability will be the...


  • , , Canada TextNow Full time

    This range is provided by TextNow. Your actual pay will be based on your skills and experience — talk with your recruiter to learn more. Base pay range CA$113,400.00/yr - CA$162,000.00/yr We believe communication belongs to everyone. We exist to democratize phone service. TextNow is evolving the way the world connects and that\'s because we\'re made up of...