Site Reliability Engineer

2 weeks ago

Montreal, Canada Experience AI Solutions Full time

Senior Systems Administrator

Start Date: as soon as possible

Type of employment: permanent

Location: Montreal, QC (hybrid model for working in the office)

Number of Positions: 1

Language skills: Excellent English language skills

Perks: Work for a multinational, award winning, socially responsible company with an operational presence in many countries, having been in business for over 75 years. It is a culturally diverse environment, employing thousands of people around the world, with beautiful downtown Montreal offices, bonuses, flexible benefits, a pension plan, and access to world-class learning.

As a Senior System Administrator, you will solve compelling technical challenges by analyzing, troubleshooting, and architecting vital services, platforms, and infrastructures, always with reliability, scalability, resilience, security, and performance in mind. Therefore, you will understand the end-to-end configuration, technical dependencies, and general behavioral characteristics of the production services you will be supporting. On the other hand, you will be responsible for helping to maintain uptime and 24x7 availability of mission-critical, customer-facing production cloud services distributed across multiple regions. You will help create more consistent and automated button push environments at all levels, proactively test and tune all aspects of the infrastructure, streamline CI/CD processes, monitor, and respond to system notifications and alerts, and continuously work to optimize and improve the performance, security, and reliability of our systems.

Principal Duties and Responsibilities:

• Contribute to creating a culture of Site Reliability Engineering across the organisation by sharing best practices, approaches, documentation, and code with other engineering teams.

• Implement automation and software to tasks or parts of the system that would benefit from it or that are performed manually.

• Troubleshoot complicated, cross-platform, managing operating systems in a cloud-based SaaS and On Premises environments, handle live production incidents, debugging/troubleshooting, and infrastructure issues, following and applying best practices.

• Conduct system analysis, configuration management, and development of enhancements for performance, availability & reliability of system software.

• Design, write, ship, and drive the creation of software and systems to increase observability, product reliability, and organizational efficiency.

• Work closely with software engineers and testers to ensure that the system correctly addresses non-functional requirements such as performance, security, and availability.

• Document system knowledge as it is acquired over time, create run books, and ensure that critical system information is readily available to those who need it.

• Maintain and oversee the deployment, orchestration, servers and overall backend infrastructure.

Education: B.Tech./B.E. degree in Electronics & Telecomm or Computer Science.

Required Skills:

• Hands-on experience managing Windows 2012, 2016 and 2019 servers; Active Directory, Group Policy design and configuration.

• Significant experience in cloud computing infrastructure and Microsoft Azure platform.

• Capacity to provide advice, best practices and recommendations for the operation and deployment of Microsoft Azure

• Extensive experience in support / management of hypervisor-based products/infrastructure (VMware, Hyper-V)

• Previous experience as an administrator of Linux systems (e.g., CentOS, RedHat) and administration of command line systems such as Bash, VIM, SSH.

• Expertise in infrastructure performance monitoring and analysis using standard performance monitoring tools - (Nagios, Azure monitoring)

• Strong knowledge of Internet protocols and applications such as SMTP, DNS, HTTP, SSH, SNMP, etc.

• Hands-on experience in server farm configuration management (using tools such as Ansible, Terraform, etc.).

• Demonstrated knowledge of ITIL methodologies, ITIL v3 or v4 certification.

Administrateur système principal

Date de début : dès que possible

Type d'emploi : permanent

Lieu : Montréal, QC (modèle hybride pour le travail au bureau)

Nombre de postes : 1

Compétences linguistiques : Excellentes compétences en anglais

Avantages : Travailler pour une entreprise multinationale, primée et socialement responsable, présente opérationnellement dans de nombreux pays, avec plus de 75 ans d'expérience. Il s'agit d'un environnement culturellement diversifié, employant des milliers de personnes à travers le monde, avec de magnifiques bureaux au centre-ville de Montréal, des bonus, des avantages flexibles, un régime de retraite et un accès à des apprentissages de classe mondiale.

En tant qu'administrateur système principal, vous résoudrez des défis techniques captivants en analysant, dépannant et concevant des services, des plateformes et des infrastructures vitaux, toujours avec la fiabilité, la scalabilité, la résilience, la sécurité et les performances à l'esprit. Par conséquent, vous comprendrez la configuration de bout en bout, les dépendances techniques et les caractéristiques comportementales générales des services de production que vous soutiendrez. D'autre part, vous serez responsable de contribuer au maintien de la disponibilité et de la disponibilité 24x7 des services cloud de production critiques pour les clients, répartis dans plusieurs régions. Vous contribuerez à créer des environnements de bouton-poussoir plus cohérents et automatisés à tous les niveaux, testerez de manière proactive et réglerez tous les aspects de l'infrastructure, rationaliserez les processus CI/CD, surveillerez et répondrez aux notifications et alertes système, et travaillerez continuellement à optimiser et à améliorer les performances, la sécurité et la fiabilité de nos systèmes.

Principales tâches et responsabilités :

Contribuer à la création d'une culture d'ingénierie de fiabilité des sites à travers l'organisation en partageant les meilleures pratiques, les approches, la documentation et le code avec d'autres équipes d'ingénierie.
Mettre en œuvre l'automatisation et les logiciels pour les tâches ou parties du système qui en bénéficieraient ou qui sont effectuées manuellement.
Dépanner des incidents de production complexes, multiplateformes, gérant des systèmes d'exploitation dans des environnements SaaS basés sur le cloud et sur site, traiter des incidents de production en direct, déboguer/résoudre des problèmes et des problèmes d'infrastructure, en suivant et en appliquant les meilleures pratiques.
Effectuer une analyse système, la gestion de la configuration et le développement d'améliorations pour les performances, la disponibilité et la fiabilité des logiciels système.
Concevoir, écrire, expédier et conduire la création de logiciels et de systèmes pour accroître l'observabilité, la fiabilité du produit et l'efficacité organisationnelle.
Travailler en étroite collaboration avec les ingénieurs logiciels et les testeurs pour s'assurer que le système répond correctement aux exigences non fonctionnelles telles que la performance, la sécurité et la disponibilité.
Documenter les connaissances système au fur et à mesure de leur acquisition, créer des manuels d'exécution et veiller à ce que les informations critiques du système soient facilement accessibles à ceux qui en ont besoin.
Maintenir et superviser le déploiement, l'orchestration, les serveurs et l'infrastructure globale.

Éducation : Diplôme B.Tech./B.E. en Électronique et Télécommunications ou en Informatique.

Compétences requises :

Expérience pratique de la gestion des serveurs Windows 2012, 2016 et 2019 ; conception et configuration de l'Active Directory et des stratégies de groupe.
Expérience significative dans l'infrastructure informatique en nuage et sur la plateforme Microsoft Azure.
Capacité à fournir des conseils, des meilleures pratiques et des recommandations pour l'exploitation et le déploiement de Microsoft Azure.
Expérience approfondie dans le support/gestion de produits/infrastructures basés sur l'hyperviseur (VMware, Hyper-V).
Expérience antérieure en tant qu'administrateur de systèmes Linux (par exemple, CentOS, RedHat) et administration de systèmes en ligne de commande tels que Bash, VIM, SSH.
Expertise dans la surveillance des performances de l'infrastructure et l'analyse à l'aide d'outils standard de surveillance des performances (Nagios, Azure monitoring).
Solide connaissance des protocoles Internet et des applications telles que SMTP, DNS, HTTP, SSH, SNMP, etc
Expérience pratique dans la gestion de la configuration de la ferme de serveurs (en utilisant des outils tels qu'Ansible, Terraform, etc.).
Connaissance démontrée des méthodologies ITIL, certification ITIL v3 ou v4.

Site Reliability Engineer

4 weeks ago

Montreal, Canada Soho Square Solutions Full time

Site Reliability Engineer (SRE) - ServiceNow, Application InfrastructureThe Application Infrastructure (AI) department is seeking a Site Reliability Engineer (SRE) to drive reliability engineering, operations, and customer support services for a ServiceNow SaaS implementation. Reporting to a Site Reliability Engineering & Operations Lead, this role involves...
Site Reliability Engineer

4 weeks ago

Montreal, Canada Soho Square Solutions Full time

Site Reliability Engineer (SRE) - ServiceNow, Application InfrastructureThe Application Infrastructure (AI) department is seeking a Site Reliability Engineer (SRE) to drive reliability engineering, operations, and customer support services for a ServiceNow SaaS implementation. Reporting to a Site Reliability Engineering & Operations Lead, this role involves...
Site Reliability Engineer

2 weeks ago

Montreal, Canada Soho Square Solutions Full time

Site Reliability Engineer (SRE) - ServiceNow, Application InfrastructureThe Application Infrastructure (AI) department is seeking a Site Reliability Engineer (SRE) to drive reliability engineering, operations, and customer support services for a ServiceNow SaaS implementation. Reporting to a Site Reliability Engineering & Operations Lead, this role involves...
Site Reliability Engineer

1 month ago

Montreal, Quebec, Québec, Canada Soho Square Solutions Full time

Site Reliability Engineer (SRE) - ServiceNow, Application InfrastructureThe Application Infrastructure (AI) department is seeking a Site Reliability Engineer (SRE) to drive reliability engineering, operations, and customer support services for a ServiceNow SaaS implementation. Reporting to a Site Reliability Engineering & Operations Lead, this role involves...
Site Reliability Engineer

7 months ago

Montreal, Canada Lyft Full time

At Lyft, our mission is to improve people’s lives with the world’s best transportation. Imagine cities where streets are safe, communities thrive, and personal cars are a thing of the past. We envision a future where shared and active transportation modes are the norm, fostering vibrant, connected neighborhoods.As a leader in micromobility, Lyft powers...
Site Reliability Engineer

5 hours ago

Montreal, Canada LanceSoft, Inc. Full time

Site Reliability EngineerMontreal, Quebec, Canada HybridDuration: 12+ monthsResponsibilities: • Are interested in distributed systems and working with highly scalable and reliable services. • Like to work in a fast-moving environment and you aren't afraid to change things to make them better. • Enjoy new technological challenges and solving hard...
AWS Site Reliability Engineer

4 weeks ago

Montreal, Canada SAP SE Full time

p>We help the world run betterAt SAP, we enable you to bring out your best. Our company culture is focused on collaboration and a shared passion to help the world run better. p>The Reliability Engineering organization provides a multitude of products and services related to operations and continuity of business delivery.The Site Reliability Engineering teams...
Site Reliability Engineer

1 day ago

Montreal, Quebec, Canada LanceSoft, Inc. Full time

Unlock a career as a Site Reliability Engineer at LanceSoft, Inc., a cutting-edge technology company based in Montreal, Quebec, Canada. We are seeking an experienced and highly motivated individual to join our team.Job Type: Full-timeDuration: 12+ monthsCompany OverviewLanceSoft, Inc. is a leading technology firm dedicated to delivering innovative solutions...
Site Reliability Engineering Leader

2 days ago

Montreal, Quebec, Canada Royal Bank of Canada Full time

Transform Your Career with a Leadership Role in Site Reliability Engineering We are seeking an experienced Senior Site Reliability Engineer to join our team at the Royal Bank of Canada. As a key member of our Digital Branch SRE organization, you will play a critical role in developing, implementing, and supporting SRE solutions for applications supported by...
AWS Site Reliability Engineer

3 weeks ago

Montreal, Canada SAP SE Full time

p>We help the world run betterAt SAP, we enable you to bring out your best. Our company culture is focused on collaboration and a shared passion to help the world run better. We focus every day on building the foundation for tomorrow and creating a workplace that embraces differences, values flexibility, and is aligned to our purpose-driven and...
AWS Site Reliability Engineer

4 months ago

Montreal, Canada Alltech Consulting Services Full time

Job Description Level 4 The Application Infrastructure (AI) department is seeking a Site Reliability Engineer (SRE) to help drive the reliability engineering, operations, and customer support services for Company’s ServiceNow SaaS implementation. Reporting to a Site Reliability Engineering & Operations Lead, this role requires delivering a range of SRE...
Site Reliability Engineer

1 day ago

Montreal, Quebec, G4F, CA LanceSoft, Inc. Full time

Site Reliability EngineerMontreal, Quebec, Canada HybridDuration: 12+ monthsResponsibilities: • Are interested in distributed systems and working with highly scalable and reliable services. • Like to work in a fast-moving environment and you aren't afraid to change things to make them better. • Enjoy new technological challenges and solving hard...
Site Reliability Engineer for Global ServiceNow Implementation

4 weeks ago

Montreal, Quebec, Canada Alltech Consulting Services Full time

We are seeking an experienced Site Reliability Engineer to join our team at Alltech Consulting Services. As a key member of our Application Infrastructure department, you will play a vital role in driving the reliability engineering, operations, and customer support services for our ServiceNow SaaS implementation.The ideal candidate will have experience in...
Site Reliability Engineer

4 weeks ago

Montreal, Canada LanceSoft, Inc. Full time

Location : Montreal (Hybrid 3 days)Duration: 12+ MonthsJob ProfileSystems Reliability Engineering (SRE) is a discipline focused on improving system service availability, observability, scalability, performance, and resilience across *** by applying sound software engineering principles and adopting the latest technology and tooling.Responsibilities:Are...
Site Reliability Engineer

4 weeks ago

Montreal, Canada LanceSoft, Inc. Full time

Location : Montreal (Hybrid 3 days)Duration: 12+ MonthsJob ProfileSystems Reliability Engineering (SRE) is a discipline focused on improving system service availability, observability, scalability, performance, and resilience across *** by applying sound software engineering principles and adopting the latest technology and tooling.Responsibilities:Are...
Site Reliability Engineer

2 weeks ago

Montreal, Canada Experience AI Solutions Full time

Senior Systems Administrator Start Date : as soon as possible Type of employment: permanent Location: Montreal, QC (hybrid model for working in the office) Number of Positions: 1 Language skills : Excellent English language skills Perks: Work for a multinational, award winning, socially responsible company with an operational presence in many...
Site Reliability Engineer

4 weeks ago

Montreal, Quebec, Québec, Canada LanceSoft, Inc. Full time

Location : Montreal (Hybrid 3 days)Duration: 12+ MonthsJob ProfileSystems Reliability Engineering (SRE) is a discipline focused on improving system service availability, observability, scalability, performance, and resilience across *** by applying sound software engineering principles and adopting the latest technology and tooling.Responsibilities:Are...
Technical Site Reliability Engineering

1 month ago

Montreal, Canada Ubisoft Entertainment Full time

h3>Technical Site Reliability Engineering (SRE) LeadFull-timeContract: PermanentFlexible Working Organization: HybridUbisoft’s 19,000 team members, working across more than 30 countries around the world, are bound by a common mission to enrich players’ lives with original and memorable gaming experiences. If you are excited about solving game-changing...
Site Reliability Engineer

2 months ago

Montreal, Canada National Bank Full time

As a Specialist in site reliability engineering on the National Bank Data Protection team, you will ensure the operational reliability of data protection assets. With your experience and knowledge in the operational management of high-availability assets (HA), you will have a positive impact on the Bank's stability and reputation with its internal and...
Site Reliability Specialist

4 weeks ago

Montreal, Quebec, Canada LanceSoft, Inc. Full time

Job SummaryWe are seeking a skilled Site Reliability Engineer to join our team at LanceSoft, Inc. in Montreal (Hybrid 3 days). This is a long-term contract position with a duration of 12+ Months.About the RoleIn this role, you will be responsible for improving system service availability, observability, scalability, performance, and resilience across various...

Americas

Europe

Asia / Oceania

Africa

Site Reliability Engineer