Current jobs related to Senior Site Reliability Engineering - Old Toronto, Ontario - NVIDIA


  • Toronto, Ontario, Canada Northbridge Financial Corporation Full time

    Senior Site Reliability EngineerAt Northbridge Financial Corporation, we are seeking a highly skilled Senior Site Reliability Engineer to join our team. As a key member of our engineering team, you will be responsible for designing, developing, and implementing site reliability solutions that align with our business goals.Key Responsibilities:Design and...


  • Toronto, Ontario, Canada Northbridge Financial Corporation Full time

    Senior Site Reliability EngineerAt Northbridge Financial Corporation, we are seeking a highly skilled Senior Site Reliability Engineer to join our team. As a key member of our engineering team, you will be responsible for designing, developing, and implementing site reliability solutions that align with our business goals.Key Responsibilities:Design and...


  • Toronto, Ontario, Canada Thomson Reuters Full time

    Job Description **About the Role** Thomson Reuters is seeking a skilled Senior Site Reliability Engineer to join our Service Management, Technology team. This role requires an individual who can analyze complex customer problems, assess the scope of impact, and mitigate customer issues. **Key Responsibilities** * Analyze and resolve high-complexity...


  • Toronto, Ontario, Canada Thomson Reuters Full time

    Job Description **About the Role** Thomson Reuters is seeking a skilled Senior Site Reliability Engineer to join our Service Management, Technology team. This role requires an individual who can analyze complex customer problems, assess the scope of impact, and mitigate customer issues. **Key Responsibilities** * Analyze and resolve high-complexity...


  • Old Toronto, Ontario, Canada Akamai Full time

    Are you driven by the desire to enhance operational processes? Do you thrive in a multicultural team of engineering professionals? Join our elite Site Reliability team at Akamai. We focus on designing, developing, and managing applications and infrastructure that underpin Akamai's Compute offerings. Our expertise lies in creating and sustaining rapid,...


  • Toronto, Ontario, Canada Euna Solutions Full time

    Job Title: Senior Site Reliability EngineerWe are seeking a highly skilled Senior Site Reliability Engineer to join our team at Euna Solutions. As a key member of our engineering team, you will play a critical role in ensuring the reliability, performance, and scalability of our cloud-based systems.Key Responsibilities:Incident Management: Lead high-severity...


  • Toronto, Ontario, Canada Euna Solutions Full time

    Job Title: Senior Site Reliability EngineerWe are seeking a highly skilled Senior Site Reliability Engineer to join our team at Euna Solutions. As a key member of our engineering team, you will play a critical role in ensuring the reliability, performance, and scalability of our cloud-based systems.Key Responsibilities:Incident Management: Lead high-severity...


  • Toronto, Ontario, Canada Euna Solutions Full time

    Job Title: Senior Site Reliability EngineerWe are seeking a highly skilled Senior Site Reliability Engineer to join our team at Euna Solutions. As a key member of our engineering team, you will play a critical role in ensuring the reliability, performance, and scalability of our cloud-based systems.Key Responsibilities:Incident Management: Lead high-severity...


  • Old Toronto, Ontario, Canada PagerDuty, Inc. Full time

    PagerDuty empowers diverse teams to perform essential tasks that drive business success through the PagerDuty Operations Cloud.We are in search of a Senior Site Reliability Engineer to become a vital member of our SRE-Platform team. In this capacity, you will play a crucial role in developing, sustaining, and scaling the Kubernetes infrastructure that...


  • Old Toronto, Ontario, Canada PagerDuty, Inc. Full time

    PagerDuty empowers diverse teams to execute essential tasks that drive business success through the PagerDuty Operations Cloud.We are looking for a Senior Site Reliability Engineer to become a vital member of our SRE-Platform team. In this capacity, you will play a significant role in developing, sustaining, and enhancing the Kubernetes infrastructure that...


  • Old Toronto, Ontario, Canada PagerDuty, Inc. Full time

    PagerDuty empowers diverse teams to drive essential operations that propel business growth through the PagerDuty Operations Cloud.We are in search of a Senior Site Reliability Engineer to become a vital member of our SRE-Platform team. In this capacity, you will play a crucial role in developing, sustaining, and enhancing the Kubernetes infrastructure that...


  • Toronto, Ontario, Canada Lightspeed Full time

    Welcome to Lightspeed Are you exploring new career avenues? You may find an exciting opportunity here. We are seeking a Senior Site Reliability Engineer to enhance our operations at Lightspeed. Our team is dedicated to developing software solutions that empower merchants to expand their business effectively. In this role, you will be instrumental in...


  • Toronto, Ontario, Canada Lightspeed Full time

    Welcome to Lightspeed! Are you exploring new career paths or simply assessing the job market? You may find the opportunity you're looking for here. We are in search of a Senior Site Reliability Engineer to enhance our NuOrder by Lightspeed team in North America. NuORDER by Lightspeed develops innovative software solutions that empower merchants to...


  • Toronto, Ontario, Canada Lightspeed Full time

    Welcome to Lightspeed Are you exploring new career paths or simply surveying the job market? You may find an exciting opportunity here. We are in search of a Senior Site Reliability Engineer to enhance our NuOrder by Lightspeed division in North America. NuORDER by Lightspeed develops innovative software solutions aimed at empowering merchants to...


  • Toronto, Ontario, Canada Northbridge Financial Corporation Full time

    Senior Site Reliability EngineerAbout the RoleThe Senior Site Reliability Engineer is a key member of our team, responsible for designing and implementing Service Level Objectives (SLOs) to ensure the reliability and performance of our systems. This role requires strong technical expertise and leadership skills to mentor and guide less experienced SREs.Key...


  • Toronto, Ontario, Canada Northbridge Financial Corporation Full time

    Senior Site Reliability EngineerAbout the RoleThe Senior Site Reliability Engineer is a key member of our team, responsible for designing and implementing Service Level Objectives (SLOs) to ensure the reliability and performance of our systems. This role requires strong technical expertise and leadership skills to mentor and guide less experienced SREs.Key...


  • Toronto, Ontario, Canada Thomson Reuters Full time

    About the RoleWe are seeking a Senior Site Reliability Engineer to join our Service Management, Technology team at Thomson Reuters. This role requires an individual who can analyze complex customer problems, assess impact, and mitigate issues while executing workarounds.Key ResponsibilitiesIdentify problem resolution options and initiate actionEngage with...


  • Toronto, Ontario, Canada Thomson Reuters Full time

    About the RoleWe are seeking a Senior Site Reliability Engineer to join our Service Management, Technology team at Thomson Reuters. This role requires an individual who can analyze complex customer problems, assess impact, and mitigate issues while executing workarounds.Key ResponsibilitiesIdentify problem resolution options and initiate actionEngage with...


  • Toronto, Ontario, Canada Behavox Full time

    About the PositionThe Behavox Platform is a robust, resilient, and high-performance system designed for the storage and processing of extensive data sets. We provide a comprehensive suite of APIs that facilitate the development of solutions enabling clients to effectively manage and analyze large volumes of information. As a Senior Site Reliability Engineer,...


  • Toronto, Ontario, Canada CIRCLE Full time

    About Circle: Circle is a pioneering financial technology firm positioned at the forefront of the evolving digital economy, where value can traverse globally, almost instantaneously, and at a lower cost compared to traditional settlement systems. This innovative layer of the internet unveils extraordinary opportunities for transactions, commerce, and...

Senior Site Reliability Engineering

3 months ago


Old Toronto, Ontario, Canada NVIDIA Full time

Site Reliability Engineering (SRE) at NVIDIA is an engineering discipline to design, build and maintain large scale production systems with high efficiency and availability using the combination of software and systems engineering practices. This is a highly specialized discipline which demand knowledge across different systems, networking, coding, database, capacity management, continuous delivery and deployment and open source cloud enabling technologies like Kubernetes and OpenStack.

SRE at NVIDIA ensures that our internal and external facing GPU cloud services run maximum reliability and uptime as promised to the users and at the same time enabling developers to make changes to the existing system through careful preparation and planning while keeping an eye on capacity, latency and performance. SRE is also a mindset and a set of engineering approaches to running better production systems and optimizations.

Much of our software development focuses on eliminating manual work through automation, performance tuning and growing efficiency of production systems. As SREs are responsible for the big picture of how our systems relate to each other, we use a breadth of tools and approaches to tackle a broad spectrum of problems. Practices such as limiting time spent on reactive operational work, blameless postmortems and proactive identification of potential outages factor into iterative improvement that is key to both product quality and interesting dynamic day-to-day work.

SRE's culture of diversity, intellectual curiosity, problem solving and openness is important to our success. Our organization brings together people with a wide variety of backgrounds, experiences and perspectives. We encourage them to collaborate, think big and take risks in a blame-free environment. We promote self-direction to work on meaningful projects, while we also strive to build an environment that provides the support and mentorship needed to learn and grow.

What you'll be doing:

  1. Design, implement and support operational and reliability aspects of large scale Kubernetes clusters with focus on performance at scale, real-time monitoring, logging and alerting
  2. Engage in and improve the whole lifecycle of services—from inception and design through deployment, operation and refinement
  3. Support services before they go live through activities such as system design consulting, developing software tools, platforms and frameworks, capacity management and launch reviews
  4. Maintain services once they are live by measuring and monitoring availability, latency and overall system health
  5. Scale systems sustainably through mechanisms like automation, and evolve systems by pushing for changes that improve reliability and velocity
  6. Practice sustainable incident response and blameless postmortems
  7. Be part of an on-call rotation to support production systems

What we need to see:

  • BS degree in Computer Science or a related technical field involving coding (e.g., physics or mathematics), or equivalent experience
  • 5+ years of experience with Infrastructure automation, distributed systems design, experience with design, develop tools for running large scale private or public cloud system in Production
  • Experience in one or more of the following: Python, Go, Perl or Ruby
  • In-depth knowledge on Linux, Networking and Containers

Ways to stand out from the crowd:

  • Interest in crafting, analyzing and fixing large-scale distributed systems
  • Systematic problem-solving approach, coupled with strong communication skills and a sense of ownership and drive
  • Ability to debug and optimize code and automate routine tasks
  • Experience in using or running large private and public cloud systems based on Kubernetes, OpenStack and Docker

NVIDIA is widely considered to be one of the technology world's most desirable employers. We have some of the most forward-thinking and hard-working people in the world working for us. Are you creative and autonomous? Do you love a challenge? If so, we want to hear from you.


#J-18808-Ljbffr