Site Reliability Expert

7 days ago


Montreal, Quebec, Canada Sony Playstation Full time

Why PlayStation?

PlayStation isn't just the Best Place to Play — it's also the Best Place to Work. Today, we're recognized as a global leader in entertainment producing The PlayStation family of products and services including PlayStation5, PlayStation4, PlayStationVR, PlayStationPlus, acclaimed PlayStation software titles from PlayStation Studios, and more.

PlayStation also strives to create an inclusive environment that empowers employees and embraces diversity. We welcome and encourage everyone who has a passion and curiosity for innovation, technology, and play to explore our open positions and join our growing global team.

The PlayStation brand falls under Sony Interactive Entertainment, a wholly-owned subsidiary of Sony Corporation.

//FRENCH FOLLOWS//

In May 2021, we embarked on a journey to start Haven Studios with a small team and big ambitions. Our goal was to build a studio where we could make the kind of games we've always wanted to create – and games we've longed to play.

We've made amazing progress in a short time thanks to our talented, passionate team and their exceptional contributions. We established a culture at Haven grounded in kindness, adaptability and courage that unlocks creativity. Our first new IP for PlayStation is on track to deliver a AAA multiplayer experience with a vision to build a systemic and evolving world focused on freedom, thrill, and playfulness that will keep players entertained and engaged for years.

Haven joined the PlayStation Studios family in 2022, and we are on track to build an exclusive new IP for Playstation and grow the first Sony game development team in Canada.

About the role

We are seeking a skilled and experienced Site Reliability Expert to join our Infrastructure and Operations SRE team and play a key role in ensuring the reliability, scalability, and performance of the cloud-based systems support our studio's game production.

What you will do
  1. System Architecture and Design:
    • Collaborate with development teams to design, implement, and maintain a robust and scalable cloud core infrastructure.
    • Work on the architecture and deployment of critical services to ensure high availability and fault tolerance.
  2. Infrastructure as Code (IaC):
    • Utilize Infrastructure as Code principles to automate the provisioning, configuration, and management of cloud infrastructure components.
    • Implement best practices for IaC tools such as Terraform or similar technologies.
  3. Monitoring and Incident Response:
    • Develop and maintain comprehensive monitoring solutions to proactively identify and address potential issues.
    • Participate in on-call rotations and respond to incidents promptly, ensuring minimal downtime and impact on users.
  4. Performance and Resource Optimization:
    • Continuously optimize system performance and resource utilization, identifying areas for improvement and implementing solutions.
    • Conduct regular performance testing and capacity planning to meet growing business needs.
  5. Security and Compliance:
    • Collaborate with security teams to implement and enforce security best practices in the cloud infrastructure.
    • Ensure compliance with industry standards and regulatory requirements.
  6. Collaboration and Documentation:
    • Work closely with development teams to streamline the deployment process and improve overall system reliability.
    • Document system configurations, procedures, and best practices for knowledge sharing and training.
    • Participate and contribute to sprints with the team.
    • Assess and size effort associated with work backlog and participate in grooming.
    • Communicate effectively with team members, production and management to ensure that project goals and deadlines are met.
What you bring :
  • 8+ years of experience as a Site Reliability Specialist, Engineer or similar role.
  • Professional experience with GCP public cloud provider.
  • In-depth knowledge of Infrastructure as Code principles and tools (e.g., Terraform).
  • Expert knowledge of configuration management tools (e.g., Ansible, SaltStack).
  • Experience implementing the CI/CD and observability toolchain.
  • Experience with containerization and orchestration technologies (e.g., Docker, Kubernetes).
  • Experience with version control systems (e.g., Perforce, Git).
  • Familiarity with monitoring and logging tools (e.g., Prometheus, ELK stack).
  • Substantial knowledge in Linux administration.
  • Strong problem-solving skills and the ability to troubleshoot complex issues.
  • Self-driven, dedicated to advancing your craft, and eager to learn new techniques and software.
  • Excellent communication and collaboration skills.
  • Ability to accept feedback and adapt to change.
Bonus Qualifications
  • Experience with AWS public cloud provider.
  • GCP/AWS certifications or any additional related professional certifications.
  • Bilingual in French and English.
  • Contributions in Open-Source software.
  • Understanding of Games as a Service technical requirements.
  • Knowledge of the Rust programming language.

En mai 2021, nous sommes lancés dans la création de Haven Studios avec une petite équipe et de grandes ambitions. Notre objectif était de construire un studio où nous pourrions créer le genre de jeux que nous avons toujours voulu créer - et auxquels nous avons toujours rêvé de jouer Nous faisons désormais partie de la famille PlayStations Studios et nous sommes fiers d'avoir l'opportunité de créer une nouvelle propriété intellectuelle exclusive pour Playstation et de former la première équipe de développement Sony au Canada.

Nous avons fait des progrès étonnants en peu de temps grâce à notre équipe talentueuse et passionnée et à ses contributions exceptionnelles. Nous avons instauré chez Haven une culture fondée sur la bienveillance, l'adaptabilité et le courage qui libère la créativité. Notre première nouvelle propriété intellectuelle pour PlayStation est sur la bonne voie pour offrir une expérience multijoueur de niveau AAA, avec la vision de construire un monde systémique et évolutif axé sur la liberté, le frisson et le jeu, qui gardera les joueur·euse·s diverti·e·s et engagé·e·s pendant des années.

Haven a rejoint la famille des studios PlayStations en 2022, et nous sommes sur la bonne voie pour créer une nouvelle propriété intellectuelle exclusive pour Playstation et continuer de bâtir la première équipe de développement de jeux Sony au Canada

Nous sommes à la recherche d'un·e expert·e en fiabilité de site compétent·e et expérimenté·e pour rejoindre notre équipe SRE (Infrastructure et Opérations) et jouer un rôle clé en garantissant la fiabilité, l'évolutivité et la performance des systèmes basés sur le cloud qui soutiennent la production de jeux de notre studio.

Responsabilités et tâches

  1. Architecture et conception du système :
    • Collaborer avec les équipes de développement pour concevoir, mettre en œuvre et maintenir une infrastructure de base robuste et évolutive.
    • Travailler sur l'architecture et le déploiement de services critiques pour assurer une haute disponibilité et une tolérance aux pannes.
  2. Infrastructure as Code (IaC) :
    • Utiliser les principes de l'infrastructure en tant que code pour automatiser le provisionnement, la configuration et la gestion des composants de l'infrastructure en nuage.
    • Mettre en œuvre les meilleures pratiques pour les outils IaC tels que Terraform ou des technologies similaires.
  3. Surveillance et réponse aux incidents :
    • Développer et maintenir des solutions de surveillance complètes afin d'identifier et de traiter de manière proactive les problèmes potentiels.
    • Participer aux rotation d'astreinte et répondre rapidement aux incidents, en veillant à minimiser les temps d'arrêt et l'impact sur les utilisateurs.
  4. Optimisation des performances et des ressources :
    • Optimiser en permanence les performances du système et l'utilisation des ressources, en identifiant les domaines susceptibles d'être améliorés et en mettant en œuvre des solutions.
    • Effectuer régulièrement des tests de performance et planifier les capacités pour répondre aux besoins croissants de l'entreprise.
  5. Sécurité et conformité :
    • Collaborer avec les équipes de sécurité pour mettre en œuvre et appliquer les meilleures pratiques de sécurité dans l'infrastructure en nuage.
    • Assurer la conformité avec les normes industrielles et les exigences réglementaires.
  6. Collaboration et documentation :
    • Travailler en étroite collaboration avec les équipes de développement pour rationaliser le processus de déploiement et améliorer la fiabilité globale du système.
    • Documenter les configurations du système, les procédures et les meilleures pratiques pour le partage des connaissances et la formation.
    • Participer et contribuer aux sprints avec l'équipe.
    • Estimer l'effort associé aux tâches de backlog et participer à l'affinement de celui-ci.
    • Communiquer efficacement avec les membres de l'équipe, la production et la direction pour s'assurer que les objectifs et les délais du projet sont respectés.

Qualifications recherchées

  • Plus de 8 ans d'expérience en tant que spécialiste de la fiabilité des sites, ingénieur·e ou dans une fonction similaire.
  • Expérience professionnelle avec le fournisseur de cloud public GCP.
  • Connaissance approfondie des principes et outils de l'Infrastructure as Code (par exemple, Terraform).
  • Connaissance experte des outils de gestion de la configuration (par exemple, Ansible, SaltStack).
  • Expérience de la mise en œuvre de la chaîne d'outils CI/CD et de l'observabilité.
  • Expérience des technologies de conteneurisation et d'orchestration (par exemple, Docker, Kubernetes).
  • Expérience des systèmes de contrôle de version (par exemple, Perforce, Git).
  • Familiarité avec les outils de surveillance et de journalisation (par exemple, Prometheus, ELK stack).
  • Connaissance approfondie de l'administration Linux.
  • Solides compétences en matière de résolution de problèmes et capacité à résoudre des problèmes complexes.
  • Vous êtes motivé·e, soucieux·euse de progresser dans votre métier et voulez apprendre de nouvelles techniques et de nouveaux logiciels.
  • Excellentes aptitudes à la communication et à la collaboration.
  • Capacité à accepter le retour d'information et à s'adapter au changement.

Qualifications idéales

  • Expérience avec le fournisseur de cloud public AWS.
  • Certifications GCP/AWS ou toute autre certification professionnelle connexe.
  • Bilingue en français et en anglais.
  • Contributions aux logiciels libres.
  • Compréhension des exigences techniques des jeux en tant que service.
  • Connaissance du langage de programmation Rust.

Equal Opportunity Statement:

Sony is an Equal Opportunity Employer. All persons will receive consideration for employment without regard to gender (including gender identity, gender expression and gender reassignment), race (including colour, nationality, ethnic or national origin), religion or belief, marital or civil partnership status, disability, age, sexual orientation, pregnancy or maternity, trade union membership or membership in any other legally protected category.

We strive to create an inclusive environment, empower employees and embrace diversity. We encourage everyone to respond.

PlayStation is a Fair Chance employer and qualified applicants with arrest and conviction records will be considered for employment.


#J-18808-Ljbffr

  • Montreal, Quebec, Canada PlayStation Full time

    Why PlayStation? PlayStation isn't just the Best Place to Play — it's also the Best Place to Work. Today, we're recognized as a global leader in entertainment producing The PlayStation family of products and services including PlayStation5, PlayStation4, PlayStationVR, PlayStationPlus, acclaimed PlayStation software titles from PlayStation Studios, and...


  • Montreal, Quebec, Canada Sony Interactive Entertainment Inc. Full time

    //FRENCH FOLLOWS//In May 2021, we embarked on a journey to start Haven Studios with a small team and big ambitions. Our goal was to build a studio where we could make the kind of games we've always wanted to create – and games we've longed to play.We've made amazing progress in a short time thanks to our talented, passionate team and their exceptional...


  • Montreal, Quebec, Canada Haven Studios Full time

    LOCATION: QUEBECEn mai 2021, nous sommes lancés dans la création de Haven Studios avec une petite équipe et de grandes ambitions. Notre objectif était de construire un studio où nous pourrions créer le genre de jeux que nous avons toujours voulu créer - et auxquels nous avons toujours rêvé de jouer Nous faisons désormais partie de la famille...


  • Montreal, Quebec, Canada Cisco Full time

    ```htmlWho We AreAs a part of Cisco, Accedian is a leader in performance analytics and end user experience solutions for service providers and mid-to-large size enterprises. The Accedian Skylight service assurance platform offers granular end-to-end visibility within "the massive multi" - multi-layer, multi-domain, and multi-vendor networks. Accedian's open...


  • Montreal, Quebec, Canada Haven Interactive Studios Full time

    En mai 2021, nous sommes lancés dans la création de Haven Studios avec une petite équipe et de grandes ambitions. Notre objectif était de construire un studio où nous pourrions créer le genre de jeux que nous avons toujours voulu créer - et auxquels nous avons toujours rêvé de jouer Nous faisons désormais partie de la famille PlayStations Studios...


  • Montreal, Quebec, Canada LanceSoft, Inc. Full time

    Job Title: Production Reliability & Support Expert (SRE)Location : Montreal ( Office attendance from Day 1 – Hybrid mode 3x per week)Years of experience : 3 to 5 years Ensure Production Management is closely aligned/embedded in the Agile software development process and our code meets prod


  • Montreal, Quebec, Canada Cisco Systems, Inc. Full time

    Cloud and Data Center, Software Development As a part of Cisco, Accedian is a leader in performance analytics and end user experience solutions for service providers and mid-to-large size enterprises. The Accedian Skylight service assurance platform offers granular end-to-end visibility within "the massive multi" - multi-layer, multi-domain, and...


  • Montreal, Quebec, Canada SAP Full time

    We help the world run better Our company culture is focused on helping our employees enable innovation by building breakthroughs together. How? We focus every day on building the foundation for tomorrow and creating a workplace that embraces differences, values flexibility, and is aligned to our purpose-driven and future-focused work. We offer a highly...


  • Montreal, Quebec, Canada Cisco Systems, Inc. Full time

    Site Reliability Engineering - Technical Leader Location: Alternate Location Area of Interest Compensation Range CAD CAD Job Type Professional Cloud and Data Center, Software Development Job Id Who We Are As a part of Cisco, Accedian is a leader in per


  • Montreal, Quebec, Canada Lyft Full time

    At Lyft, our mission is to enhance people's lives with top-notch transportation services. We strive to foster an inclusive and diverse environment in our community, valuing the unique contributions of each team member. Our goal is to revolutionize the way the world approaches transportation, envisioning a future where cities feel more connected and...


  • Montreal, Quebec, Canada Noverka Conseil Full time

    At Noverka, our values illustrate who we are and define our beliefs: Human, Transparent, Passionate. We are driven by innovation and success, both in our relationships and in our practices.Finding the right job for the right person is what we do bestOur client, an organization in the banking industry is looking for a Site Reliability Engineering (SRE)...


  • Montreal, Quebec, Canada National Bank Full time

    As a Site Reliability Specialist, Business Intelligence and Data Management, you will play a key role within a DevOps squad that is working to innovate, develop new ways of integrating data into our assets and maintain the availability and reliability of our assets in production. You will be tasked with helping clients and consumers more easily use the data...


  • Montreal, Quebec, Canada CGI Full time

    Position Description:CGI is a dynamic and innovative technology firm committed to delivering cutting-edge solutions. We are currently seeking a highly skilled and motivated individual to join our team as a FinOps and Site Reliability Engineer (SRE). This role is pivotal in bridging our finance and technology teams to ensure the successful implementation and...


  • Montreal, Quebec, Canada Socotra, Inc. Full time

    At Lyft, our mission is to improve people's lives with the world's best transportation. Imagine cities where streets are safe, communities thrive, and personal cars are a thing of the past. We envision a future where shared and active transportation modes are the norm, fostering vibrant, connected neighborhoods. As a leader in micromobility, Lyft powers...


  • Montreal, Quebec, Canada Lightspeed Full time

    Welcome to NuOrder by LightspeedAre you actively looking for a new opportunity? Or just checking the market? Well... you might just be in the right place We're looking for a Principal Site Reliability Engineer to join our NuOrder by Lightspeed team in North America.NuORDER by Lightspeed builds software solutions that help merchants grow the size and the...


  • Montreal, Quebec, Canada Lightspeed Full time

    Hi there Thanks for stopping by Are you actively looking for a new opportunity? Or just checking the market? Well... you might just be in the right place We're looking for a Principal Site Reliability Engineer to join our NuOrder by Lightspeed team in North America. NuORDER by Lightspeed builds software solutions that help merchants grow the size and...


  • Montreal, Quebec, Canada National Bank Full time

    As a Site Reliability Specialist, Business Intelligence and Data Management, you will play a key role within a DevOps squad that is working to innovate, develop new ways of integrating data into our assets and maintain the availability and reliability of our assets in production. You will be tasked


  • Montreal, Quebec, Canada Lightspeed Full time

    Hi there Thanks for stopping by Are you actively looking for a new opportunity? Or just checking the market? Well... you might just be in the right place We're looking for a Principal Site Reliability Engineer to join our NuOrder by Lightspeed team in North America. NuORDER by Lightspeed builds software solutions that help merchants grow the size and the...


  • Montreal, Quebec, Canada Behavox Full time

    About BehavoxBehavox is shaping the future for how businesses harness their most important raw material - data. Our mission is bold: Organize enterprise data into actionable information that protects and promotes the business growth of multinational companies around the world.From managing enterprise risk and compliance to maximizing revenue and value, our...


  • Montreal, Quebec, Canada Behavox Full time

    About BehavoxBehavox is shaping the future for how businesses harness their most important raw material - data. Our mission is bold: Organize enterprise data into actionable information that protects and promotes the business growth of multinational companies around the world.From managing enterprise risk and compliance to maximizing revenue and value, our...