Senior Site Reliability Engineer- Remote

4 weeks ago


Old Toronto, Canada ClickHouse Full time

We are committed to providing our customers with reliable and secure services so we are building out our newly formed Site Reliability Engineering team. As one of the first joiners to our Reliability Engineering Team at ClickHouse, you will be responsible for building and leading processes to ensure the reliability, availability, scalability, and performance of our cloud infrastructure that runs ClickHouse databases. You will collaborate with different teams like Control Plane, Dataplane, Core, Security, Support and Operations and guide them to design and implement scalable, secure, highly available and fault-tolerant distributed systems. You will also own the areas of incident management and response, post-mortem analysis including running blameless postmortems, and continuous improvement of our ClickHouse services. You will be leveraging your software engineering expertise to develop software platforms and tools to optimize the operational and engineering efficiencies of ClickHouse Cloud. This role is a unique opportunity to make a significant impact on our elastic, limitless scale, high-performance, serverless ClickHouse Cloud.

What will you do?

  • Collaborate with various engineering teams in ClickHouse to design and implement scalable, secure, and highly available systems for ClickHouse.
  • Establish and manage service level objectives (SLOs) and service level agreements (SLAs) for ClickHouse Cloud.
  • Ensure all the infrastructure components in ClickHouse Cloud (including Dataplane, Control Plane and ClickHouse Core) have monitoring and alerting in place to ensure timely detection and resolution of incidents.
  • Enhance and refine incident response processes and post-mortem analysis for any outages in ClickHouse Cloud including working with the support team to communicate to the impacted customers.
  • Continuously improve the reliability and performance of our ClickHouse services.
  • Plan, enable, and drive Chaos initiatives across Engineering teams, based upon internal priorities.
  • Manage on-call processes to respond to performance and reliability issues, and establish best practices for coordinating escalation to resolve issues and minimize downtime.

About you:

  • Bachelor’s or Master’s degree in Computer Science or a related field.
  • At least 8 years of experience in Site Reliability Engineering or a related field.
  • Previous experience using ClickHouse in production.
  • Hands on experience with Go and/or Python.
  • Strong knowledge of cloud computing platforms such as AWS, Azure, or Google Cloud Platform.
  • Excellent understanding of distributed databases and SQL, particularly ClickHouse is a major plus.
  • Hands on experience with container orchestration tools such as Kubernetes or Docker Swarm.
  • Strong experience with automation and configuration management tools such as Ansible, Terraform, or Puppet.
  • You are a strong problem solver and have solid production debugging skills.
  • You are passionate about efficiency, availability, scalability, and data governance.
  • You thrive in a fast paced environment, and see yourself as a partner with the business with the shared goal of moving the business forward.
  • You have a high level of responsibility, ownership, and accountability.
  • Excellent communication and interpersonal skills.

#LI-Remote

#J-18808-Ljbffr

  • Old Toronto, Canada Akamai Full time

    Are you passionate about cutting edge technology? Do solving some of the Internet's most difficult content delivery challenges interest you? Join our Compute Site Reliability team! Our team is responsible for monitoring and measuring the reliability of our suite of Compute products and platform. In collaboration with Engineering and Product teams, we focus...


  • Old Toronto, Canada Akamai Full time

    Are you passionate about cutting edge technology? Do solving some of the Internet's most difficult content delivery challenges interest you? Join our Compute Site Reliability team! Our team is responsible for monitoring and measuring the reliability of our suite of Compute products and platform. In collaboration with Engineering and Product teams, we focus...


  • Old Toronto, Canada Akamai Full time

    Are you passionate about cutting edge technology? Do solving some of the Internet's most difficult content delivery challenges interest you? Join our Compute Site Reliability team! Our team is responsible for monitoring and measuring the reliability of our suite of Compute products and platform. In collaboration with Engineering and Product teams, we focus...


  • Toronto, ON, Canada ClickHouse Full time

    We are committed to providing our customers with reliable and secure services so we are building out our newly formed Site Reliability Engineering team. As one of the first joiners to our Reliability Engineering Team at ClickHouse, you will be responsible for building and leading processes to ensure the reliability, availability, scalability, and...


  • Old Toronto, Canada Akamai Full time

    Do you have a passion for cutting edge technologies and tackling system problems? Are you a self-starting professional who thrives in a dynamic environment? Join our Site Reliability team. Our Team builds and delivers highly secure network security frameworks to protect our customers. We collaborate to create next-generation initiatives supporting...


  • Toronto, ON, Canada Akamai Full time

    Are you passionate about cutting edge technology? Do solving some of the Internet's most difficult content delivery challenges interest you? Join our Compute Site Reliability team! Our team is responsible for monitoring and measuring the reliability of our suite of Compute products and platform. In collaboration with Engineering and Product teams, we focus...


  • Toronto, ON, Canada Akamai Full time

    Are you passionate about cutting edge technology? Do solving some of the Internet's most difficult content delivery challenges interest you? Join our Compute Site Reliability team! Our team is responsible for monitoring and measuring the reliability of our suite of Compute products and platform. In collaboration with Engineering and Product teams, we...


  • Old Toronto, Canada eTeam Full time

    Remote Work Duration 4 months - Preference is to find candidates who are willing to be converted to full-time employees. The conversion decision will be made based on performance. Job Description Role Description: Defining and measuring reliability goals—SLIs, SLOs, and error budgets for user journey. Designing for and implementing observability (ELK,...


  • Toronto, ON, Canada Akamai Full time

    Do you have a passion for cutting edge technologies and tackling system problems? Join our Site Reliability team. Our Team builds and delivers highly secure network security frameworks to protect our customers. We collaborate to create next-generation initiatives supporting automation, deployment, and monitoring of 3rd party cloud infrastructure. Help us...


  • Old Toronto, Canada Akamai Full time

    Are you passionate about cutting edge technology? Does building next generation cloud computing technology excite you? Join our Compute Site Reliability team! Our team is responsible for monitoring and measuring the reliability of our suite of Compute products and platform. In collaboration with Engineering and Product teams, we focus on improving the...


  • Old Toronto, Canada Autodesk Full time

    Position Overview Virtual and augmented reality are transforming design and creation through new immersive and collaborative experiences to improve how major segments like entertainment, architecture, engineering, construction, and manufacturing converge. Many industries are being transformed by the growth of XR technology, creating new ways of working to...


  • Old Toronto, Canada Autodesk Full time

    Position Overview Virtual and augmented reality are transforming design and creation through new immersive and collaborative experiences to improve how major segments like entertainment, architecture, engineering, construction, and manufacturing converge. Many industries are being transformed by the growth of XR technology, creating new ways of working to...


  • Old Toronto, Canada Autodesk Full time

    Position Overview Virtual and augmented reality are transforming design and creation through new immersive and collaborative experiences to improve how major segments like entertainment, architecture, engineering, construction, and manufacturing converge. Many industries are being transformed by the growth of XR technology, creating new ways of working to...


  • Old Toronto, Canada Thomson Reuters Full time

    (Canada) Site Reliability Engineer (Contract) Contract (9 months 4 days) Published 3 days ago New Relic Data Dog Site Reliability Engineer - in the Service Management OrganizationDo you have experience in IT Service Management, working with cloud providers, software development, and technology infrastructure?The Site Reliability Engineer will...


  • Old Toronto, Canada Autodesk Full time

    Position Overview Autodesk, the leading Design and Make Software Company, is looking for a Principal Site Reliability Engineer to join the Autodesk Platform Services Engineering team in Toronto, Canada. In this role, you will help build trusted services of APS (Autodesk Platform Services) measured by Service Level Objectives (SLOs) and Mean Time to Recovery...


  • Old Toronto, Canada CB Canada Full time

    Site Reliability Engineer On behalf of our client in the Banking Sector, PROCOM is looking for a Site Reliability Engineer. Site Reliability Engineer – Job Description Azure cloud Jira and confluence CICD Experience with automating (provisioning, configuration management, deployment) and integrating Azure PaaS solutions (Azure App services, Azure...


  • Old Toronto, Canada CB Canada Full time

    Site Reliability Engineer On behalf of our client in the Banking Sector, PROCOM is looking for a Site Reliability Engineer. Site Reliability Engineer – Job Description Azure cloud Jira and confluence CICD Experience with automating (provisioning, configuration management, deployment) and integrating Azure PaaS solutions (Azure App services, Azure...


  • Toronto, ON, Canada Autodesk Full time

    Position Overview Virtual and augmented reality are transforming design and creation through new immersive and collaborative experiences to improve how major segments like entertainment, architecture, engineering, construction, and manufacturing converge. Many industries are being transformed by the growth of XR technology, creating new ways of working to...


  • Old Toronto, Canada Akamai Full time

    Are you intrigued by planetary scale, distributed, intelligent systems? Do you like collaborating across teams to solve complex problems? Join our highly skilled Site Reliability Engineering team. Our team designs, develops, and manages applications and infrastructure that support Akamai's Compute products and services. We do this while maintaining Akamai's...


  • Old Toronto, Canada Akamai Full time

    Are you intrigued by planetary scale, distributed, intelligent systems? Do you like collaborating across teams to solve complex problems? Join our highly skilled Site Reliability Engineering team. Our team designs, develops, and manages applications and infrastructure that support Akamai's Compute products and services. We do this while maintaining Akamai's...