Sr Eng Manager, SRE

4 weeks ago


Toronto, Canada mccainfood Full time

 

 

Position Title: Sr Eng Manager, SRE & Observability 

Position Type: Regular - Full-Time ​

Position Location: Toronto HQ 

Requisition ID: 31044 

 

 

JOB PURPOSE:

Reporting to the Director, Infrastructure Operations, the Sr Engineering Manager, SRE &Observability will be responsible for: Design, implement and monitor enterprise-grade secure fault-tolerant SRE and Observability infrastructure.

 

Senior manager is an engineering leader who will lead members of the engineering staff working across the organization to provide a friction-less experience to our customers and maintain the highest standards of reliability and availability. Our team thrives and succeeds in delivering high-quality technology products and services in a hyper-growth environment where priorities shift quickly. The ideal candidate has broad and deep technical knowledge experience to improve application's performance, capacity benchmarking, improve availability, security and reliability, design and evolve cloud/infrastructure architecture, and leverage engineering solutions to solve operational problems. Also should have deep technical expertise in software engineering, Kubernetes, Metrics, Logs, Traces, Synthetics, Digital Experience Monitoring, DevOps, Big data processing, and open-source Observability platform domain

 

JOB RESPONSIBILITIES:

  • Develop and implement a Observability and  SRE strategy
  • Collaborate with the Infrastructure, applications and Data teams to understand their pain points around monitoring, performance, efficiency, reliability, availability, and formulate strategies to address recurring issues in a sustainable way.
  • Influence and build vision with application owners to ship quality products in a faster pace.
  • Ownership of the end-to-end delivery of team strategy and execution
  • Develop and motivate teams to solve complex problems and be a strong advocate for open-source technologies and solutions.
  • Be technically hands-on in coding as well as building highly available systems.
  • Be responsible for building and mentoring a new team of software engineers
  • Drive the team towards building solutions towards the long-term goals while ensuring that high priority tech debts are solved in an efficient way.
  • Be a strong thought leader in Site Reliability engineering, Observability, Operational excellence, Big Data processing, and DevOps Principles.
  • Consistently share best practices and improve processes within and across teams.
  • Hands-on Software engineering manager with strong understanding of Site Reliability Engineering, Big Data processing, Observability and DevOps principles.
  • Fluency with at least one modern language such as Python, Java, Go and experience with open-source software is a big plus.
  • Hands-on experience in managing infrastructure components through Infrastructure as Code using Terraform, Ansible
  • Strong technical acumen in Cloud Architecture, Observability, Performance Benchmarking, Capacity planning and Reliability tools.
  • Expert in Container orchestration (e.g., Kubernetes), container runtimes and OS (Operating System) optimization.
  • Experience in Observability platforms, application monitoring tools and performance analysis techniques.
  • Experience managing & growing technical leaders and teams.
  • In-depth knowledge of data structures and algorithms.
  • Expert in Open-source observability software like Grafana, Prometheus, and OTEL
  • Knowledge in ML and AI technologies
  • Develop and improve instrumentation for monitoring and logging the health and availability of services.
  • Proactively monitor systems, networks, and applications to provide input in improving the stability, security, efficiency, and scalability of systems.
  • Develop and maintain Monitoring and Logging Frameworks for all of ITX
    Take personal responsibility for the quality, reliability and availability of global IT corporate infrastructure.
  • Own operations documentation of monitoring and logging for global IT production infrastructure.
  • Participate in rotating on-call incident response on the weekdays and on the weekends.
    Improve operational efficiencies via scripting, bots and integrations.
  • Participate cross functionally with vendors and other IT engineering teams to ensure smooth service delivery.
  • Network and systems troubleshooting, fault analysis, and resolution.
  • Collaborate with Incident and Problem Management to reduce MTTR and Incident volume.
  • Design, implement, and maintain AIOps solutions to monitor and analyze IT systems, applications, and networks.
  • Deploy machine learning algorithms for anomaly detection, root cause analysis, and incident prediction.
  • Configure and manage observability tools and platforms to gain real-time visibility into system health and performance.
  • Develop monitoring dashboards, alerts, and reports to provide comprehensive insights into the IT environment.
  • Conduct root cause analysis for incidents using data from AIOps and observability tools to identify underlying issues.
  • Work closely with software engineers to instrument applications with appropriate logging, metrics, and tracing capabilities
  • Continuously analyze monitoring data to identify trends, anomalies, and opportunities for optimization.
  • Stay updated with industry trends and advancements in AIOps and observability practices, and recommend new tools or methodologies for adoption
  • Designing, developing, and implementing AI models and algorithms utilizing state-of-the-art techniques such as GPT, VAE, and GANs.
  • Collaborating with cross-functional teams to define AI project requirements and objectives, ensuring alignment with overall business goals.
  • Conducting research to stay up-to-date with the latest advancements in generative AI, machine learning, and deep learning techniques and identify opportunities to integrate them into our products and services.
  • Optimizing existing generative AI models for improved performance, scalability, and efficiency.
  • Developing and maintaining AI pipelines, including data preprocessing, feature extraction, model training, and evaluation.
  • Developing clear and concise documentation, including technical specifications, user guides, and presentations, to communicate complex AI concepts to both technical and non-technical stakeholders.
  • Contributing to the establishment of best practices and standards for generative AI development within the organization.
  • Providing technical mentorship and guidance to junior team members.
  • Apply trusted AI practices to ensure fairness, transparency, and accountability in AI models and systems
  • Drive DevOps and MLOps practices, covering continuous integration, deployment, and monitoring of AI
  • Utilize tools such as Docker, Kubernetes, and Git to build and manage AI pipelines
  • Implement monitoring and logging tools to ensure AI model performance and reliability
  • Collaborate seamlessly with software engineering and operations teams for efficient AI model integration and deployment.
  • Familiarity with DevOps and MLOps practices, including continuous integration, deployment, and monitoring of AI models.

 

KEY QUALIFICATION & EXPERIENCES:

  • Minimum 10 years of experience in Observability/Monitoring tools
  • Bachelor's or Master's degree in Computer Science, Computer Engineering, or a related field.
  • 5+ years of industry experience in software development.
  • In-depth experience designing at scale monitoring and logging for corporate infrastructure services.
  • Expert level experience in monitoring and logging technologies, both open source and closed source (e.g. AppDynamics, Newrelic, Datadog, Prometheus, Grafana, LogicMonitor, SumoLogic, ELK)
  • Experience in implementing Metrics, Logs and Tracing for E2E observability
  • Experience in RBAC and user based security services such as ISE, Radius, LDAP, and AD.
  • Must have strong automation/scripting skills - proficiency in Python or Golang is a plus.
  • Proficient in developing and maintaining technical documentation, runbooks, and procedures.
  • A working knowledge in Network is needed. Fundamental knowledge of TCP/IP stack, application protocols (DHCP/DNS/HTTPs) and networking concepts (HSRP/NAT/VPN/VLANs/802.1x/Wireless/Clustering/High Availability/Load Balancing).
  • Understanding of enterprise networks using Cisco IOS/NXOS with a working knowledge of IP Protocols (TCP/UDP/ICMP) and Routing Protocols (BGP/OSPF/IS-IS).
  • Technology understanding of Cisco, Cloud Native Firewalls, including Firewall Policy Rules, URL-Filtering, App-ID, User-ID, etc.
  • Experience interacting with Telco and Global ISPs (WAN/DIA) and the monitoring of those services.
  • A working knowledge of systems is needed. Fundamental knowledge of Configuration Management and Automation tools, with experience in:
    * Terraform, Ansible, Chef, Puppet, Jenkins
    * Designing and implementing CI/CD pipelines
    * Infrastructure provisioning and management
  • Strong in troubleshooting incidents in production environment.
  • A strong ownership attitude and a track record of taking responsibility for problems and pushing through to resolution.
  • Ability to communicate and coordinate with cross-functional engineering teams across multiple geographic regions.
  • Experience with AIOps and machine learning is highly desirable.
  • Knowledge of OpenTelemetry is an added advantage.
  • Experience with other monitoring tools like Prometheus, Grafana, etc.
  • Experience with Observability solutions like Dynatrace, DataDog, Instana etc. is highly desirable
  • Experience working with mainframe systems is a plus (willingness to learn is also acceptable).
  • Excellent problem-solving and analytical skills.
  • Strong communication and collaboration skills.
  • Ability to work independently and manage multiple projects simultaneously.
  • Passion for learning new technologies and continuous improvement.
  • In-depth knowledge of machine learning, deep learning, and generative AI techniques
  • Knowledge and experience in Generative AI
  • Proficiency in programming languages such as Python, R, and frameworks like TensorFlow or PyTorch
  • Strong understanding of NLP techniques and frameworks such as BERT, GPT, or Transformer models
  • Familiarity with computer vision techniques for image recognition, object detection, or image generation
  • Experience with cloud platforms such as Azure or AWS
  • Knowledge of IT operations concepts and processes, such as monitoring, incident management, root cause analysis, remediation.
  • Strong problem solving and analytical skills.
  • Strong interpersonal and written and verbal communication skills.
  • Highly adaptable to changing circumstances. Interest in continuously learning new skills and technologies. 
  • Experience with programming and scripting languages (e.g. Java, C#, C++, Python, Bash, PowerShell).
  • Experience with incident and response management.

 

Qualifications

  • Bachelor’s degree (or equivalent years of experience).
  • 5+ years of relevant work experience. SRE experience required.
  • Background in Manufacturing, Platform/Tech compnies is preferred.
  • Must have Public Cloud provider certifications (Azure, GCP or AWS)
  • Having CNCF certification is plus

 

OTHER INFORMATION

  • Travel: as required.
  • Job is primarily performed in a Hybrid office environment.

 

Key SRE and Observability Overview and Boundaries

Infrastructure Design: Requires knowledge of: Software architecture; Distributed systems; Scalability; Design patterns; Disaster Recovery; Tech

Stacks; Non-Functional Requirements; Security standards, frameworks, and methodologies (System Security Plan -SSP, Security Risk and

Compliance Review- SRCR etc.) To assist in creation of simple, modular, extensible and functional design for the product/solution in adherence to the requirements. Evaluate trade-offs while designing across multiple components in a system based on the business requirements. Convert HLD to create detailed design for specific modules / components of a product/system. Understand nuances of designing for disaster recovery. Undertake infrastructure coding automation.

 Performance and Optimization : Requires knowledge of: Unix/Linux performance optimization tuning; Java/NodeJS/Tomcat/Apache tuning and optimization; Opensource Chaos tools (for example, Openblade, Chaos Monkey, Pumba, Chaos Mesh, Litmus, Chaos Toolkit, ToxiProxy) To evaluate appropriate reliability models to evaluate and estimate complex reliability parameters. Designs and develops a reliability program plan for a complex site environment. Facilitates reliability testing procedures. Ensures reliability testing procedures align with site environment changes.

Integrates the business goals of site reliability engineering and site safety engineering. Trains team members on the development and implementation of tools and applications for reliability predictions and improvements. Decides criteria selection and evaluation for site reliability analysis and assessment. Facilitates Opensource Chaos experiments to test and validate the resiliency of applications.

 

Solution Design : Requires knowledge of: Software architecture; Distributed systems; Scalability; Design patterns; Disaster Recovery; Tech Stacks; Minimum Viable Product- MVP; Non-Functional Requirements; Telemetry To create simple, modular, extensible and functional design in adherence to the requirements for multiple products/solutions within a domain. Understand Customer requirements and analyze the gaps between existing architecture and customer requirements. Analyze system performance impacting the complete product for non-functional requirements like reliability, operability, performance efficiency and security. Create detailed design using mock screens, pseudo codes and detailed functional logic of the modules for an entire product. Finalize the tech stack (For example MEAN, LAMP etc.) - for products/systems based on the business needs. Review the MVP to uncover risks and check for performance and usability; guide the team during MVP creation. Drive design of software, production and preproduction environments and deployment pipeline to continuously generate records for telemetry.

 

Coding : Requires knowledge of: Coding standards and guidelines; Coding languages (E.g. JavaScript, Python, C# etc.), frameworks(E.g. ActiveX, .Net, Cocoa, Android application framework etc.), tools(E.g. Monday.com, Linx, Embold etc.) and Platforms (E.g. Microsoft Azure, AWS , Apple IOS etc.); Quality, Safety and Security (PCI etc) standards; Emerging tools and technologies; Telemetry. To create/configure minimalistic code for entire component/application and ensure the components are meeting business/technical requirements, non-functional requirements, low-maintainability, high-availability and high-scalability needs. Assist in the selection of appropriate languages (E.g. JavaScript, Python, C# etc.), development standards and tools (E.g. Monday.com, Linx, Embold etc.)for software coding/configuration. Take initiative to learn the fundamentals of different coding languages and frameworks that would be useful for future scope of work. Build scripts for automation of repetitive and routine tasks in CI/CD (Continuous Integration/Continuous Delivery), Testing or any other process (as applicable). Implement telemetry features as required independently. Ensure security policy requirements are properly applied to components/application during code development/configuration.

 

Triaging and Troubleshooting : Possesses knowledge of: Regression testing; Root cause analysis (RCA); Root cause corrective action (RCCA) To analyze defects from past projects/solutions to avoid recurrence. Troubleshoots performance and availability bottlenecks for assigned application independently. Triages to detect and determine symptom versus cause of defects. Actively provides data for and participates in RCA.

 

Disaster Recovery Planning : Requires knowledge of: Disaster recovery procedures and processes; Enterprise disaster recovery systems. To work with business partners to identify and document critical applications. Interprets and follows procedures in contingency plans. Explains the contingency and disaster recovery plans for assigned environment. Executes established procedures necessary to continue operations in an emergency. Participates in the design of a minimum operating environment for a computer-based facility.

 

Monitoring and Alerting : Requires knowledge of: Monitoring and alerting tools; Monitoring metrics and key performance indicators (for example, availability, MTBF, MTTR); SLIs and SLOs (for example, request latency, availability, error rates, saturation); Distributed tracing; Alerting logic. To suggest metrics to monitor software or system performance. Monitors current performance data to ensure compliance with defined SLOs for multiple applications/systems. Determines thresholds for monitoring metrics and triggers alerts based on thresholds. Supervises specific procedures to proactively check the health of applications and infrastructure, including a variety of operating systems, hardware, and software. Makes recommendations regarding situational awareness and alerting. Make recommendations regarding instrumentation gaps and alerting logic.

 

Drives the execution of multiple business plans and projects by identifying customer and operational needs; developing and communicating business plans and priorities; removing barriers and obstacles that impact performance; providing resources; identifying performance standards; measuring progress and adjusting performance accordingly; developing contingency plans; and demonstrating adaptability and supporting continuous learning.

Provides supervision and development opportunities for associates by selecting and training; mentoring; assigning duties; building a team-based work environment; establishing performance expectations and conducting regular performance evaluations; providing recognition and rewards; coaching for success and improvement; and ensuring diversity awareness.

 

Promotes and supports company policies, procedures, mission, values, and standards of ethics and integrity by training and providing direction to others in their use and application; ensuring compliance with them; and utilizing and supporting the Open Door Policy.

 

Ensures business needs are being met by evaluating the ongoing effectiveness of current plans, programs, and initiatives; consulting with business partners, managers, co-workers, or other key stakeholders; soliciting, evaluating, and applying suggestions for improving efficiency and cost effectiveness; and participating in and supporting community outreach events. 

 

The above information indicates the general nature and level of work performed by employees within this classification.  It is not a comprehensive inventory of all duties, responsibilities and qualifications required of employees assigned to this job.

McCain Foods is an equal opportunity employer. We see value in ensuring we have a diverse, antiracist, inclusive, merit-based, and equitable workplace.  As a global family-owned company we are proud to reflect the diverse communities around the world in which we live and work. We recognize that diversity drives our creativity, resilience, and success and makes our business stronger. 
 
McCain is an accessible employer. If you require an accommodation throughout the recruitment process (including alternate formats of materials or accessible meeting rooms), please let us know and we will work with you to meet your needs.

The health and safety of McCain employees and their families has been our number one priority since the start of COVID-19 pandemic. With vaccination restrictions easing across the globe we do not currently require employees to be vaccinated, but we reserve the right to change this mandate in line with health guidance and regulations in each country.
 
Your privacy is important to us. By submitting personal data or information to us, you agree this will be handled in accordance with the Global Privacy Policy 

 

Job Family: Information Technology      
Division: Global Digital Technology 
Department: ​Infrastructure and Operations ​
Location(s): CA - Canada : Ontario : Toronto || US - United States of America : Illinois : Oakbrook Terrace 

Company: McCain Foods (Canada) 


  • Sre / DevOps

    1 month ago


    Toronto, Canada Virtusa Full time

    Core Sr Tech Lead P2 C2 TSTS Primary Skills AWS CDK Typescript Ansible Dynatrace Secondary Skills CI/CD DevOps JD Need SRE Lead who can Deploy and manage tooling to monitor Azure based production systems relative to performance, reliability, and scale. Ensure that architecture and deployment models are sufficient to support SLA commitments Leverage...

  • SRE / Devops

    4 weeks ago


    Old Toronto, Canada Virtusa Full time

    Core Sr Tech Lead Primary Skills: AWS CDK Typescript Ansible Dynatrace Secondary Skills: CI/CD DevOps Job Description Need SRE Lead who can deploy and manage tooling to monitor Azure-based production systems relative to performance, reliability, and scale. Ensure that architecture and deployment models are sufficient to support SLA commitments. Leverage...


  • Toronto, Canada S.i. Systems Full time

    Sr SRE to collaborate with IP Network specialists/architects to troubleshoot and resolve issues, deploying automation & reliability initiatives on an infrastructure set of 125+ server for our large technology client -CREQ008170 Experience SRE Engineers with support experience only. Is remote work available? Are there any required days in office? 3 days in...


  • Toronto, Canada S.i. Systems Full time

    Sr SRE to collaborate with IP Network specialists/architects to troubleshoot and resolve issues, deploying automation & reliability initiatives on an infrastructure set of 125+ server for our large technology client -CREQ008170Experience SRE Engineers with support experience only.Is remote work available? Are there any required days in office?3 days in...

  • Sre

    6 months ago


    Toronto, Canada Q1 Technologies Full time

    Skills and Responsibilities: - Owner of the Production Environment: Has independent veto power on changes. Is business aligned and understands business outcomes. - Experience owning change management, release management and Production support. - Experience in an Operational Role? DevOps, SRE, and Software Engineering - Understands code integrity Merges,...


  • Toronto, Canada Index Exchange Full time

    About Index: We shaped the earliest forms of ad tech, and we’re looking for the technical expertise to help shape its future. Our customers have unique problems that can only be solved at internet scale, and that’s where the technical skills of our team make a real difference. Our exchange handles more than 450 billion requests every day (for...


  • Toronto, Canada Sentry Full time

    About Sentry Bad software is everywhere, and we’re tired of it. Sentry is on a mission to help developers write better software faster, so we can get back to enjoying technology. With more than $217 million in funding and 100,000+ organizations that believe we’re on to something, we're building performance and error monitoring tools that help...

  • SRE Production Support

    4 months ago


    Toronto, Canada Lorven Technologies Full time

    Our client is looking SRE Production Support for long term project in Toronto, ON (Hybrid) Below is the detail requirement. Role: SRE Production Support Location: Toronto , ON (Hybrid) Job Description: 7+ Years of Experience Engineering: Develop SRE solutions (monitoring and alerting, machine learning anomaly detection, self-healing...

  • DevOps Sre Manager

    6 months ago


    Toronto, Canada Actionstep Full time

    Actionstep is a pioneer in the development and sale of software-as-a-service (SaaS) products, specializing in the delivery of Legal Practice Management software. We are a fast growing, dynamic business with a global customer base and team. Headquartered in Auckland, New Zealand, with team members in the United Kingdom, United States, Canada and Australia, we...

  • Director of SRE/DevOps

    3 months ago


    Toronto, Canada Understanding Recruitment Full time

    Director of SRE/DevOps We are seeking a Director of SRE/DevOps for an innovative Series-C Scale-Up revolutionising customer service within the e-commerce space.The team have developed their own AI powered customer support platform designed specifically for e-commerce businesses. The platform helps to centralize customer interactions from various channels,...

  • Director of SRE/DevOps

    2 months ago


    Toronto, Ontario, Ontario, Canada Understanding Recruitment Full time

    Director of SRE/DevOps We are seeking a Director of SRE/DevOps for an innovative Series-C Scale-Up revolutionising customer service within the e-commerce space.The team have developed their own AI powered customer support platform designed specifically for e-commerce businesses. The platform helps to centralize customer interactions from various channels,...


  • Toronto, Ontario, C6A, Ontario, Canada S.i. Systems Full time

    Sr SRE to collaborate with IP Network specialists/architects to troubleshoot and resolve issues, deploying automation & reliability initiatives on an infrastructure set of 125+ server for our large technology client -CREQ008170Experience SRE Engineers with support experience only.Is remote work available? Are there any required days in office? 3 days in...

  • Process Manager

    2 months ago


    Toronto, Canada AstraNorth Full time

    **Experience (Years): 10 & Above** **Job Summary**: - We are seeking a highly skilled Process Manager to join our team. - The Process Manager will work closely with Site Reliability Engineering (SRE) teams and business units to ensure seamless process implementation and continuity. - This role will involve creating playbooks, implementing RACI charts, and...


  • Toronto, Canada Unity Health Toronto Full time

    **CLINICAL ENG TECH CLINICAL ENG** (JOB ID: 835)**: The Clinical Engineering Department provides the general management of medical technology for the Hospital. The department acts as a consultant to nursing directors, clinical leader/managers, nurses, doctors, and paramedical personnel in determining whether a specific technology will help them in the...


  • Toronto, Ontario, Canada mccainfood Full time

    About McCain FoodsMccain Foods is a leading manufacturer of frozen food products. As we continue to grow and innovate, we're seeking an experienced Software Engineering Manager to lead our Site Reliability Engineering (SRE) team.

  • Lead SRE Engineer

    4 weeks ago


    Toronto, Canada Royal Bank of Canada> Full time

    Job SummaryJob DescriptionWhat is the opportunity?RBC is seeking a Lead SRE for our US Cash Management Technology. This is a brand-new system to serve our corporate clients. You will be heavily involved in shaping the future technology landscape of RBC, by delivering key business values for a transformational project in our Banking Technology while...


  • Toronto, Ontario, C6A, Ontario, Canada S.i. Systems Full time

    Our Tier 1 Canadian banking client is seeking a SRE to help with an enterprise wide IT Risk management platform program.Must Have Skills:Excellent knowledge / Hands-on experience with Ansible Tower / Ansible Automation Platform (automating tasks with PowerShell, Bash (Linux scripting), Python or another languageOrchestrating automations with AnsibleHands-on...

  • Senior SRE Engineer

    4 weeks ago


    Toronto, Canada Royal Bank of Canada> Full time

    Job SummaryJob DescriptionWhat is the opportunity?RBC is seeking a Lead SRE for our US Cash Management Technology. This is a brand-new system to serve our corporate clients. You will be heavily involved in shaping the future technology landscape of RBC, by delivering key business values for a transformational project in our Banking Technology while...


  • Toronto, Canada Arcadis Full time

    Job Description Arcadis Building Engineering Group is looking for a licensed Sr. Electrical Engineer (Member of the Ordre des ingénieurs du Québec) to join our growing Montreal team. The Sr. Electrical Engineer will lead electrical building design from concept and detail design, to construction support, for projects in the Montreal area, as well as...


  • Toronto, Canada The Home Depot Canada Full time

    With a career at The Home Depot, you can be yourself and also be part of something bigger. Position Overview: The Manager, SRE will lead a team of Site Reliability Engineers to ensure the reliability, performance, and operational support of our eCommerce systems, with a focus on Google Cloud Platform (GCP) environments. This role requires a strong...