Senior Site Reliability Engineer- Remote

4 weeks ago


Old Toronto, Canada ClickHouse Full time

We are committed to providing our customers with reliable and secure services so we are building out our newly formed Site Reliability Engineering team. As one of the first joiners to our Reliability Engineering Team at ClickHouse, you will be responsible for building and leading processes to ensure the reliability, availability, scalability, and performance of our cloud infrastructure that runs ClickHouse databases. You will collaborate with different teams like Control Plane, Dataplane, Core, Security, Support and Operations and guide them to design and implement scalable, secure, highly available and fault-tolerant distributed systems. You will also own the areas of incident management and response, post-mortem analysis including running blameless postmortems, and continuous improvement of our ClickHouse services. You will be leveraging your software engineering expertise to develop software platforms and tools to optimize the operational and engineering efficiencies of ClickHouse Cloud. This role is a unique opportunity to make a significant impact on our elastic, limitless scale, high-performance, serverless ClickHouse Cloud.

What will you do?

  • Collaborate with various engineering teams in ClickHouse to design and implement scalable, secure, and highly available systems for ClickHouse.
  • Establish and manage service level objectives (SLOs) and service level agreements (SLAs) for ClickHouse Cloud.
  • Ensure all the infrastructure components in ClickHouse Cloud (including Dataplane, Control Plane and ClickHouse Core) have monitoring and alerting in place to ensure timely detection and resolution of incidents.
  • Enhance and refine incident response processes and post-mortem analysis for any outages in ClickHouse Cloud including working with the support team to communicate to the impacted customers.
  • Continuously improve the reliability and performance of our ClickHouse services.
  • Plan, enable, and drive Chaos initiatives across Engineering teams, based upon internal priorities.
  • Manage on-call processes to respond to performance and reliability issues, and establish best practices for coordinating escalation to resolve issues and minimize downtime.

About you:

  • Bachelor’s or Master’s degree in Computer Science or a related field.
  • At least 8 years of experience in Site Reliability Engineering or a related field.
  • Previous experience using ClickHouse in production.
  • Hands on experience with Go and/or Python.
  • Strong knowledge of cloud computing platforms such as AWS, Azure, or Google Cloud Platform.
  • Excellent understanding of distributed databases and SQL, particularly ClickHouse is a major plus.
  • Hands on experience with container orchestration tools such as Kubernetes or Docker Swarm.
  • Strong experience with automation and configuration management tools such as Ansible, Terraform, or Puppet.
  • You are a strong problem solver and have solid production debugging skills.
  • You are passionate about efficiency, availability, scalability, and data governance.
  • You thrive in a fast paced environment, and see yourself as a partner with the business with the shared goal of moving the business forward.
  • You have a high level of responsibility, ownership, and accountability.
  • Excellent communication and interpersonal skills.

#LI-Remote

#J-18808-Ljbffr

  • Old Toronto, Canada Practice Better Full time

    About us:Practice Better is a leading all-in-one practice management software solution transforming how health & wellness professionals run their practices and support their clients. The company serves 15,000+ customers in over 70+ countries across the globe, and processes hundreds of millions annually in payments on behalf of customers. Over 65% of growth...


  • Old Toronto, Canada Practice Better Full time

    About us:Practice Better is a leading all-in-one practice management software solution transforming how health & wellness professionals run their practices and support their clients. The company serves 15,000+ customers in over 70+ countries across the globe, and processes hundreds of millions annually in payments on behalf of customers. Over 65% of growth...


  • Old Toronto, Canada Akamai Full time

    Are you passionate about cutting edge technology? Do solving some of the Internet's most difficult content delivery challenges interest you? Join our Compute Site Reliability team! Our team is responsible for monitoring and measuring the reliability of our suite of Compute products and platform. In collaboration with Engineering and Product teams, we focus...

  • Senior DevOps

    3 weeks ago


    Old Toronto, Canada Veem Company Full time

    Senior DevOps & Site Reliability Engineer (Remote)About Veem:Veem empowers small and medium businesses who spend too much time and money dealing with inefficient financial payment systems. Our transparent, relationship-based payments model makes it easy to build trust with your vendors, contractors and customers by providing a quick and seamless payable and...

  • Senior DevOps

    1 week ago


    Old Toronto, Canada Veem Company Full time

    Senior DevOps & Site Reliability Engineer (Remote)About Veem:Veem empowers small and medium businesses who spend too much time and money dealing with inefficient financial payment systems. Our transparent, relationship-based payments model makes it easy to build trust with your vendors, contractors and customers by providing a quick and seamless payable and...


  • Old Toronto, Canada Practice Better Full time

    About the Position: Job Title: Senior Site Reliability Engineer Location: The candidate must be located in Canada or the USA. Our office is in Toronto, ON, Canada, but the role is remote/hybrid/flexible. Reports to: VP, Technology Position Overview: We are on a mission to build an industry-leading product on a strong foundation built by a world-class...


  • Old Toronto, Canada Lloyds Banking Group Full time

    Job Description - Senior Site Reliability EngineerJOB TITLE: Senior Site Reliability Engineer (SRE)LOCATION: Halifax, Leeds or ManchesterHOURS: Full-timeWORKING PATTERN: Our work style is hybrid, which involves spending at least two days per week, or 40% of our time, at one of our office sites.Who are Lloyds Banking Group and where does this role sit?If you...


  • Old Toronto, Canada Lloyds Banking Group Full time

    Job Description - Senior Site Reliability EngineerJOB TITLE: Senior Site Reliability Engineer (SRE)LOCATION: Halifax, Leeds or ManchesterHOURS: Full-timeWORKING PATTERN: Our work style is hybrid, which involves spending at least two days per week, or 40% of our time, at one of our office sites.Who are Lloyds Banking Group and where does this role sit?If you...


  • Old Toronto, Canada Reperio Human Capital Full time

    Site Reliability Engineer 100421 Desired skills: Site Reliability Engineer, SRE, Cloud, Permanent, Remote Site Reliability Engineer Location: Ireland/UK Salary: €70K+ Type: Permanent, Full-time We're seeking experienced Site Reliability Engineers who excel at ensuring the reliability and scalability of production systems, and possess extensive experience...


  • Old Toronto, Canada Reperio Human Capital Full time

    Site Reliability Engineer 100421 Desired skills: Site Reliability Engineer, SRE, Cloud, Permanent, Remote Site Reliability Engineer Location: Ireland/UK Salary: €70K+ Type: Permanent, Full-time We're seeking experienced Site Reliability Engineers who excel at ensuring the reliability and scalability of production systems, and possess extensive experience...


  • toronto, Canada OnX Canada Full time

    OnX is looking for a Site Reliability Engineer for one our clients in Toronto. Client: Financial Services Location: Toronto, mostly remote Duration: 6 months with potential extension JBoss in middleware experience is super important Responsibilities: Following the senior technicians plans to buil


  • toronto, Canada OnX Canada Full time

    OnX is looking for a Site Reliability Engineer for one our clients in Toronto. Client: Financial Services Location: Toronto, mostly remote Duration: 6 months with potential extension JBoss in middleware experience is super important Responsibilities: Following the senior technicians plans to buil


  • Old Toronto, Canada Reperio Human Capital Full time

    Site Reliability Engineer 100421 Desired skills: Site Reliability Engineer, SRE, Cloud, Permanent, Remote Location: Ireland/UK Salary: €70K+ Type: Permanent, Full-time We're seeking experienced Site Reliability Engineers who excel at ensuring the reliability and scalability of production systems, and possess extensive experience with monitoring and...


  • Old Toronto, Canada Reperio Human Capital Full time

    Site Reliability Engineer 100421 Desired skills: Site Reliability Engineer, SRE, Cloud, Permanent, Remote Location: Ireland/UK Salary: €70K+ Type: Permanent, Full-time We're seeking experienced Site Reliability Engineers who excel at ensuring the reliability and scalability of production systems, and possess extensive experience with monitoring and...


  • Old Toronto, Canada Zendesk Full time

    Job Description Zendesk is a service-first CRM company that builds powerful, customizable software designed to improve customer relations. At Zendesk, we encourage growth, innovation, and believe in giving back to the communities we call home. The ideal candidate will want to join a growing team. You have recent experience with full-stack cloud native...


  • Old Toronto, Canada Zendesk Full time

    Job Description Zendesk is a service-first CRM company that builds powerful, customizable software designed to improve customer relations. At Zendesk, we encourage growth, innovation, and believe in giving back to the communities we call home. The ideal candidate will want to join a growing team. You have recent experience with full-stack cloud native...


  • Toronto, ON, Canada Akamai Full time

    Are you passionate about cutting edge technology? Do solving some of the Internet's most difficult content delivery challenges interest you? Join our Compute Site Reliability team! Our team is responsible for monitoring and measuring the reliability of our suite of Compute products and platform. In collaboration with Engineering and Product teams, we...


  • Old Toronto, Canada eTeam Full time

    Remote Work Duration 4 months - Preference is to find candidates who are willing to be converted to full-time employees. The conversion decision will be made based on performance. Job Description Role Description: Defining and measuring reliability goals—SLIs, SLOs, and error budgets for user journey. Designing for and implementing observability (ELK,...


  • Old Toronto, Canada eTeam Full time

    Remote Work Duration 4 months - Preference is to find candidates who are willing to be converted to full-time employees. The conversion decision will be made based on performance. Job Description Role Description: Defining and measuring reliability goals—SLIs, SLOs, and error budgets for user journey. Designing for and implementing observability (ELK,...


  • Toronto, ON, Canada Hour Consulting Full time

    Our client, a fast growing Fintech Startup is on a mission to redefine how to protect user identity, providing users secure control over personal information through a privacy compliant network. This approach creates higher customer interaction and sales conversions, while improving overall security for both customers and businesses. They are a...