System Reliability Engineer

3 weeks ago


Toronto, Canada CGI Full time

Position Description:

We are Canada's largest independent information technology services firm, and after 40 years, we're still growing Innovation, technology, and service delivery are our focus. Our goal is to ensure our clients remain ahead of the competition. We provide a full spectrum of managed services from IT and business process outsourcing to systems integration and consulting that are transforming our clients’ operations and helping them to succeed.

Do you enjoy working with a highly motivated and talented team to deliver mission critical developer tooling? We are currently expanding our System Reliability Engineering team that helps one of our key clients deploy, manage, troubleshoot, and enhance their developer tooling platform, servicing over developers.

As a System Reliability Engineer, you will be responsible for designing, implementing, and supporting a verity of developer productivity tools that include Ansible Tower, GitLab, Artifactory and SonarQube. The technology stack used to manage the platform includes Ansible, Terraform, Python, Prometheus, Splunk, and ELK.

You will build automation solutions to provision and validate infrastructure and help debug and resolve problems. You will help to improve operational performance by focusing on user experience, effectively assessing and managing risk, and minimizing the impact of failures.

Responsibilities

•Keeping all components of the developer productivity platform up and running
•Working closely with internal partners and platform users to ensure that all services meet security, SLA, and performance requirements
•Writing, updating, and using documentation, including runbooks and playbooks
•Automating infrastructure deployment, testing, application failover, failure mitigation, user self-service functions, and more
•Debugging complex problems across the entire stack
•Participating in various meetings with the Operations and Delivery teams.
•Lead Daily/Weekly Meetings to discuss the overall health of the systems.
•Leading Root Cause Analysis calls
•Propose and implement Monitoring Improvements/Optimization and Automation Opportunities
•Take part in PI (Program Increment) Planning sessions

Key Skills and Attributes

•5 years experience with software engineering, software development, or system operations
•Experience working with Linux and can write shell scripts and understands Linux internals and performance tuning
•Strong understanding of networking principles
•Experience debugging large scale complex systems in production
•Experience in building, implementing, and supporting highly available production systems
•Experience automating infrastructure and deployments using Terraform, Ansible, and Python or equivalent technologies
•Understanding of DevOps engineering, CI/CD, and software deployment
•Working knowledge of developer tooling such as Artifactory, GitLab, SonarQube, and Ansible Tower
•Experience with various monitoring and observability tools
•Experience deploying and managing workloads on one of the major public cloud platforms, private clouds such as OpenStack
•Experience deploying and managing workloads on one of the major container management platforms like Kubernetes, OpenShift, PCF or Rancher
•A curiosity about how complex socio-technical systems operate and what happens during failure

It’s not expected that any single candidate would have experience across all these areas – we are looking for someone who is strong in a few areas and has interest and curiosity in others.

#LI-SH1

Skills:

DevOps Engineering GitHub OpenShift Linux
  • Reliability Engineer

    3 weeks ago


    Toronto, ON, Canada Chelsea Avondale Full time

    Chelsea Avondale is the world’s most cutting-edge home insurance group. We have developed the most sophisticated risk modeling and insurance pricing technologies for home insurance and deploy that technology through our own insurance company. Our team consists of some of the brightest minds in insurance, software development, finance, and operations. Our...

  • Reliability Engineer

    3 weeks ago


    Old Toronto, Canada Chelsea Avondale Full time

    Chelsea Avondale is the world’s most cutting-edge home insurance group. We have developed the most sophisticated risk modeling and insurance pricing technologies for home insurance and deploy that technology through our own insurance company. Our team consists of some of the brightest minds in insurance, software development, finance, and operations. Our...

  • Reliability Engineer

    4 weeks ago


    Toronto, Canada Tata Consultancy Services Full time

    About TCS:TCS operates on a global scale, with a diverse talent base of more than 600,000 associates representing 153 nationalities across 55 countries. TCS has been recognized as a Global Top Employer by the Top Employers Institute - one of only eight companies worldwide to have achieved this status. Our organizational structure is domain-led and designed...

  • Reliability Engineer

    4 weeks ago


    Toronto, Canada Tata Consultancy Services Full time

    About TCS:TCS operates on a global scale, with a diverse talent base of more than 600,000 associates representing 153 nationalities across 55 countries. TCS has been recognized as a Global Top Employer by the Top Employers Institute - one of only eight companies worldwide to have achieved this status. Our organizational structure is domain-led and designed...

  • Reliability Engineer

    4 weeks ago


    Toronto, Canada Tata Consultancy Services Full time

    About TCS: TCS operates on a global scale, with a diverse talent base of more than 600,000 associates representing 153 nationalities across 55 countries. TCS has been recognized as a Global Top Employer by the Top Employers Institute - one of only eight companies worldwide to have achieved this status. Our organizational structure is domain-led and designed...

  • Reliability Engineer

    2 weeks ago


    Toronto, Ontario, Canada CSG Talent Full time

    Join a Leading Mining Company in Canada as a Reliability Engineer. This is the best opportunity to grow your career in the maintenance department with a large mining company with its global assets.This is residential role and it comes with very attractive salary and a great relocation and living allowances. Description:Make a significant impact by minimizing...


  • Old Toronto, Canada Nityo Infotech Full time

    Job Responsibilities: Objectives of this Role Run the IKP clusters by monitoring availability and taking a holistic view of system health Build tools and automation to manage platform infrastructure and services Improve reliability, quality, and time to upgrade cluster and service versions Measure and optimize system performance and resource utilization,...


  • Toronto, ON, Canada Nityo Infotech Full time

    Job Responsibilities: Objectives of this Role Run the IKP clusters by monitoring availability and taking a holistic view of system health Build tools and automation to manage platform infrastructure and services Improve reliability, quality, and time to upgrade cluster and service versions Measure and optimize system performance and resource...


  • Toronto, Canada BMO Full time

    Application Deadline: 04/29/2024Address:33 Dundas Street WestThis role is Hybrid (1-2 days per week in the office)The Director - Site Reliability Engineering will lead a team that will work with application teams, infrastructure teams, and business partners to continuously improve the stability, reliability and efficiency of Finance and Enterprise Risk...


  • Toronto, Ontario, Canada Forhyre Full time

    We are looking for someone that is generalist at heart, one who is curious, appreciates complexity, knows or wants to learn when to step back and when to dive deep. We call this role a Cloud Service Reliability Engineer. The Cloud Service Reliability Engineer will be responsible for effective design, execution, and maintenance of systems implemented on...


  • Toronto, Canada Forhyre Full time

    We are looking for someone that is generalist at heart, one who is curious, appreciates complexity, knows or wants to learn when to step back and when to dive deep. We call this role a Cloud Service Reliability Engineer. The Cloud Service Reliability Engineer will be responsible for effective design, execution, and maintenance of systems implemented on...


  • Toronto, ON, Canada Thomson Reuters Full time

    (Canada) Site Reliability Engineer (Contract) Contract (9 months 4 days) Published 3 days ago New Relic Data Dog Site Reliability Engineer - in the Service Management Organization Do you have experience in IT Service Management, working with cloud providers, software development, and technology infrastructure? The Site Reliability Engineer will analyze...


  • Old Toronto, Canada Thomson Reuters Full time

    (Canada) Site Reliability Engineer (Contract) Contract (9 months 4 days) Published 3 days ago New Relic Data Dog Site Reliability Engineer - in the Service Management OrganizationDo you have experience in IT Service Management, working with cloud providers, software development, and technology infrastructure?The Site Reliability Engineer will...


  • Toronto, Canada CB Canada Full time

    Site Reliability Engineer On behalf of our client in the Banking Sector, PROCOM is looking for a Site Reliability Engineer. Site Reliability Engineer – Job Description Azure cloud Jira and confluence CICD Experience with automating (provisioning, configuration management, deployment) and integrating Azure PaaS solutions (Azure App services, Azure...


  • Toronto, Canada Autodesk Full time

    Position Overview Autodesk, the leading Design and Make Software Company, is looking for a Principal Site Reliability Engineer to join the Autodesk Platform Services Engineering team in Toronto, Canada. On this position, you will help build trusted services of APS (Autodesk Platform Services) as measured by Service Level Objectives (SLOs) and Mean...


  • Toronto, ON, Canada Lightspeed Full time

    Hi there! Thanks for stopping by. Are you actively looking for a new opportunity? Or just checking the market? Well… you might just be in the right place! We’re looking for a Principal Site Reliability Engineer to join our NuOrder by Lightspeed team in North America. NuORDER by Lightspeed builds software solutions that help merchants grow the size and...


  • Old Toronto, Canada Lightspeed Full time

    Hi there! Thanks for stopping by. Are you actively looking for a new opportunity? Or just checking the market? Well… you might just be in the right place! We’re looking for a Principal Site Reliability Engineer to join our NuOrder by Lightspeed team in North America. NuORDER by Lightspeed builds software solutions that help merchants grow the size and...


  • Toronto, Canada DesignWorks Engineer & Inspections Ltd. Full time

    **Building Energy Simulation Specialist - Toronto** Hello and welcome to Design Works Engineering! We are a multi-discipline engineering firm inclusive of civil engineering, structural engineering, mechanical engineering, electrical engineering, energy modelling, and fire protection design. We are one of the fastest growing engineering firms in the nation...


  • Old Toronto, Canada Autodesk Full time

    Position Overview Autodesk, the leading Design and Make Software Company, is looking for a Principal Site Reliability Engineer to join the Autodesk Platform Services Engineering team in Toronto, Canada. In this role, you will help build trusted services of APS (Autodesk Platform Services) measured by Service Level Objectives (SLOs) and Mean Time to Recovery...


  • Old Toronto, Canada CB Canada Full time

    Site Reliability Engineer On behalf of our client in the Banking Sector, PROCOM is looking for a Site Reliability Engineer. Site Reliability Engineer – Job Description Azure cloud Jira and confluence CICD Experience with automating (provisioning, configuration management, deployment) and integrating Azure PaaS solutions (Azure App services, Azure...