AI SRE Engineer

3 days ago


Montreal, Quebec, Canada Tata Consultancy Services Full time $100,000 - $150,000 per year

Inclusion without Exception:

Tata Consultancy Services (TCS) is an equal opportunity employer, and embraces diversity in race, nationality, ethnicity, gender, age, physical ability, neurodiversity, and sexual orientation, to create a workforce that reflects the societies we operate in. Our continued commitment to Culture and Diversity is reflected in our people stories across our workforce and implemented through equitable workplace policies and processes.

About TCS:

TCS is an IT services, consulting, and business solutions organization that has been partnering with many of the world's largest businesses in their transformation journeys for over 55 years. Its consulting-led, cognitive-powered portfolio of business, technology, and engineering services and solutions is delivered through its unique Location Independent Agile delivery model, recognized as a benchmark of excellence in software development. A part of the Tata group, India's largest multinational business group, TCS operates in 55 countries and employs over 607,000 highly skilled individuals, including more than 10,000 in Canada. The company generated consolidated revenues of US $ 30 billion in the fiscal year ended March 31, 2025, and is listed on the BSE and the NSE in India. TCS' proactive stance on climate change and award-winning work with communities across the world have earned it a place in leading sustainability indices such as the MSCI Global Sustainability Index and the FTSE4Good Emerging Index.

Technical Skills:

  • Production experience in SRE / Infrastructure / ops for large-scale systems
  • Strong programming/scripting skills (Python, Go, Java, or equivalent)
  • Deep experience with containerization (Docker), orchestration (Kubernetes, etc.)
  • Infrastructure-as-code (Terraform, Helm, CloudFormation, Ansible, etc.)
  • Familiarity with GPU / AI compute clusters, high-performance data storage, and distributed architectures
  • Experience with monitoring / observability / logging / alerting tools (Prometheus, Grafana, ELK / EFK, Datadog, etc.)
  • Networking & systems engineering knowledge (TCP/IP, DNS, routing, load balancing, distributed storage)
  • Solid experience in capacity planning, performance tuning, scaling, and incident response
  • Demonstrated ability to lead RCAs, deploy fixes, and drive reliability improvements
  • Experience in regulated environments (financial services, compliance, audit, security) is a strong plus
  • Excellent communication, documentation, and cross-team collaboration skills

Proven track record of reducing operational toil via automation.

Skills and Responsibilities:

  • Operate, monitor, and maintain the infrastructure supporting GenAI applications (training,

inference, feature store, data ingestion, model serving)

  • Design and build automation for core platform capabilities, reducing manual toil
  • Develop and maintain infrastructure-as-code (IaC) for provisioning and managing compute, storage, network, GPU clusters, Kubernetes / container orchestration, etc.
  • Establish, monitor, and enforce SLOs/SLIs/SLAs, error budgets, alerting, and dashboards
  • Lead incident response, root cause analysis (RCA), postmortems, and systemic remediation
  • Perform capacity planning, scaling strategies, workload scheduling, and resource forecasting
  • Optimize cost vs. performance tradeoffs in large-scale compute environments
  • Harden systems for security, compliance, auditability, and data governance
  • Collaborate across teams (cloud engineers, data engineers, infrastructure, security) to

ensure safe deployment, rollout, rollback, and integration of new systems

  • Define disaster recovery (DR) strategies, backup/restore practices, fault tolerance mechanisms
  • Maintain runbooks, operational playbooks, documentation, and training materials
  • Participate in on-call rotations and respond to production incidents 24/7 as needed
  • Continuously evaluate and integrate new tools, frameworks, or technologies to enhance platform reliability

Tata Consultancy Services Canada Inc. is committed to meeting the accessibility needs of all individuals in accordance with the Accessibility for Ontarians with Disabilities Act (AODA) and the Ontario Human Rights Code (OHRC). Should you require accommodation during the recruitment and selection process, please inform Human Resources.

Thank you for your interest in TCS. Candidates that meet the qualifications for this position will be contacted within a 2-week period. We invite you to continue to apply for other opportunities that match your profile.


  • sre

    1 week ago


    Montreal, Quebec, Canada 40ea5ed9-d248-4404-8b74-d362c729ca25 Full time $100,000 - $160,000 per year

    Intelcom | DragonflyWith more than 100 sorting stations and operations across three continents, Intelcom | Dragonfly is Canada's leader in last-mile logistics. Our vision is clear: to deliver fast, accurate, and reliable service powered by cutting-edge technology.A Strategic Role at the Heart of LogisticsResponsibilitiesIncident Management: Detect and...


  • Montreal, Quebec, Canada BURGEON IT SERVICES Full time $80,000 - $120,000 per year

    Position: SRE Azure EngineerLocation: Montreal, Canada (Onsite)Duration: Long TermPlease share the resume at We are looking for a strong technologist and a doer who is willing to lead by example by being hands on every day. This role will be supporting Institutional Securities and Wealth Management brokerage Operations platforms which include diverse...


  • Montreal, Quebec, Canada BURGEON IT SERVICES Full time $80,000 - $120,000 per year

    Position: SRE Azure EngineerLocation: Montreal, Canada (Day 1 Onsite)Duration: Long TermPlease share the resume at We are looking for a strong technologist and a doer who is willing to lead by example by being hands on every day. This role will be supporting Institutional Securities and Wealth Management brokerage Operations platforms which include diverse...

  • Core L3 SRE

    3 days ago


    Montreal, Quebec, Canada Atlantis IT Group Full time $120,000 - $160,000 per year

    Technical/Functional Skills • 8+ years of overall IT experience.• Advanced Linux / Unix support experience required.• Strong shell scripting and python programming skills for SRE related activities required.• Experience on using Splunk OR Grafana/Prometheus/Loki stack required, preferably both.• General understanding on Veritas Cluster Service,...

  • SRE Azure

    3 days ago


    Montreal, Quebec, Canada ApTask Full time $80,000 - $120,000 per year

    About Client:The client is a global provider of digital business transformation, digital engineering, and information technology (IT) outsourcing services that accelerate our clients' journey to their digital future. The company readily understands its clients' business challenges and uses its domain expertise to deliver innovative applications of technology...

  • AI Engineer

    1 week ago


    Montreal, Quebec, Canada ELITS Full time $120,000 - $140,000 per year

    ELITS Canada Inc. is a subsidiary of Accelerate at Iver. Our parent company Iver has 1,700 employees today and is growing and now we are looking for new employees who want to join our journey. As part of Accelerate at Iver, you are at the absolute forefront of technology and work on exciting digitization and transformation projects with market-leading...

  • AI Engineer

    17 hours ago


    Montreal, Quebec, Canada ELITS Full time US$60,000 - US$120,000 per year

    ELITS Canada Inc. is a subsidiary of Accelerate at Iver. Our parent company Iver has 1,700 employees today and is growing and now we are looking for new employees who want to join our journey. As part of Accelerate at Iver, you are at the absolute forefront of technology and work on exciting digitization and transformation projects with market-leading...

  • Founding Engineer

    1 week ago


    Montreal, Quebec, Canada Arvo A.I. Full time $90,000 - $200,000 per year

    Senior / Founding Software Engineer (Infrastructure)Location: Montreal or San Francisco, or OnlineCompany: Arvo AICompensation: CAD $100k–$200k with equityEligibility: Must be legally entitled to work in Canada at start date (no sponsorship)Role & ImpactAs one of Aurora's earliest senior engineers, you'll shape the architecture, reliability, and evolution...


  • Montreal, Quebec, Canada Plusgrade Full time $60,000 - $90,000 per year

    **English is available below*** Les voyages vont bien au-delà de leur destination ; ils sont tissés de chaque souvenir que l'on crée en chemin. Notre engagement consiste à redéfinir l'avenir du voyage en collaborant avec plus de 250 compagnies aériennes, établissements hôteliers, sociétés de croisières, réseaux ferroviaires pour voyageurs et...

  • Analytics Engineer

    1 week ago


    Montreal, Quebec, Canada Maxa AI Full time $80,000 - $120,000 per year

    About The RoleAt Maxa, Analytics Engineers provide clean data sets, modeling data in a way that empowers end users to answer their own questions. Analytics Engineers can also develop reports and train users on how to use data in visualizations tools.This role requires excellent software engineering, data modeling and reporting skills. Key...