AI SRE Engineer
5 days ago
Inclusion without Exception:Tata Consultancy Services (TCS) is an equal opportunity employer, and embraces diversity in race, nationality, ethnicity, gender, age, physical ability, neurodiversity, and sexual orientation, to create a workforce that reflects the societies we operate in. Our continued commitment to Culture and Diversity is reflected in our people stories across our workforce and implemented through equitable workplace policies and processes.About TCS:TCS is an IT services, consulting, and business solutions organization that has been partnering with many of the world's largest businesses in their transformation journeys for over 55 years. Its consulting-led, cognitive-powered portfolio of business, technology, and engineering services and solutions is delivered through its unique Location Independent Agile delivery model, recognized as a benchmark of excellence in software development. A part of the Tata group, India's largest multinational business group, TCS operates in 55 countries and employs over 607,000 highly skilled individuals, including more than 10,000 in Canada. The company generated consolidated revenues of US $ 30 billion in the fiscal year ended March 31, 2025, and is listed on the BSE and the NSE in India. TCS' proactive stance on climate change and award-winning work with communities across the world have earned it a place in leading sustainability indices such as the MSCI Global Sustainability Index and the FTSE4Good Emerging Index.Technical Skills:Production experience in SRE / Infrastructure / ops for large-scale systemsStrong programming/scripting skills (Python, Go, Java, or equivalent)Deep experience with containerization (Docker), orchestration (Kubernetes, etc.)Infrastructure-as-code (Terraform, Helm, CloudFormation, Ansible, etc.)Familiarity with GPU / AI compute clusters, high-performance data storage, and distributed architecturesExperience with monitoring / observability / logging / alerting tools (Prometheus, Grafana, ELK / EFK, Datadog, etc.)Networking & systems engineering knowledge (TCP/IP, DNS, routing, load balancing, distributed storage)Solid experience in capacity planning, performance tuning, scaling, and incident responseDemonstrated ability to lead RCAs, deploy fixes, and drive reliability improvementsExperience in regulated environments (financial services, compliance, audit, security) is a strong plusExcellent communication, documentation, and cross-team collaboration skills Proven track record of reducing operational toil via automation.Skills and Responsibilities: Operate, monitor, and maintain the infrastructure supporting GenAI applications (training, inference, feature store, data ingestion, model serving)Design and build automation for core platform capabilities, reducing manual toilDevelop and maintain infrastructure-as-code (IaC) for provisioning and managing compute, storage, network, GPU clusters, Kubernetes / container orchestration, etc.Establish, monitor, and enforce SLOs/SLIs/SLAs, error budgets, alerting, and dashboardsLead incident response, root cause analysis (RCA), postmortems, and systemic remediationPerform capacity planning, scaling strategies, workload scheduling, and resource forecastingOptimize cost vs. performance tradeoffs in large-scale compute environmentsHarden systems for security, compliance, auditability, and data governanceCollaborate across teams (cloud engineers, data engineers, infrastructure, security) to ensure safe deployment, rollout, rollback, and integration of new systems Define disaster recovery (DR) strategies, backup/restore practices, fault tolerance mechanismsMaintain runbooks, operational playbooks, documentation, and training materialsParticipate in on-call rotations and respond to production incidents 24/7 as neededContinuously evaluate and integrate new tools, frameworks, or technologies to enhance platform reliability Tata Consultancy Services Canada Inc. is committed to meeting the accessibility needs of all individuals in accordance with the Accessibility for Ontarians with Disabilities Act (AODA) and the Ontario Human Rights Code (OHRC). Should you require accommodation during the recruitment and selection process, please inform Human Resources.Thank you for your interest in TCS. Candidates that meet the qualifications for this position will be contacted within a 2-week period. We invite you to continue to apply for other opportunities that match your profile.
-
AI SRE Engineer
2 weeks ago
Montréal, QC, Canada Tata Consultancy Services Full timeInclusion without Exception: Tata Consultancy Services (TCS) is an equal opportunity employer, and embraces diversity in race, nationality, ethnicity, gender, age, physical ability, neurodiversity, and sexual orientation, to create a workforce that reflects the societies we operate in. Our continued commitment to Culture and Diversity is reflected in our...
-
AI SRE Engineer
4 days ago
Montréal, QC, Canada Tata Consultancy Services Full timeInclusion without Exception: Tata Consultancy Services (TCS) is an equal opportunity employer, and embraces diversity in race, nationality, ethnicity, gender, age, physical ability, neurodiversity, and sexual orientation, to create a workforce that reflects the societies we operate in. Our continued commitment to Culture and Diversity is reflected in our...
-
AI SRE Engineer
3 days ago
Montréal, Qc, Canada Tata Consultancy Services Full timeInclusion without Exception: Tata Consultancy Services (TCS) is an equal opportunity employer, and embraces diversity in race, nationality, ethnicity, gender, age, physical ability, neurodiversity, and sexual orientation, to create a workforce that reflects the societies we operate in. Our continued commitment to Culture and Diversity is reflected in our...
-
Site Reliability Engineer
4 weeks ago
Quebec, Canada ALLTECH CONSULTING SVC INC Full timeJob Description Level 4 Overview The Application Infrastructure (AI) department is seeking a Site Reliability Engineer (SRE) to help drive the reliability engineering, operations and customer support services for ServiceNow SaaS implementation. Reporting to a Site Reliability Engineering & Operations Lead. This role requires delivering a range of SRE...
-
Site Reliability Engineer
4 weeks ago
Quebec, Canada ALLTECH CONSULTING SVC INC Full timeJob Description Level 4 Overview The Application Infrastructure (AI) department is seeking a Site Reliability Engineer (SRE) to help drive the reliability engineering, operations and customer support services for ServiceNow SaaS implementation. Reporting to a Site Reliability Engineering & Operations Lead. This role requires delivering a range of SRE...
-
Site Reliability Engineer
4 weeks ago
Quebec, Canada ALLTECH CONSULTING SVC INC Full timeJob Description Level 4 Overview The Application Infrastructure (AI) department is seeking a Site Reliability Engineer (SRE) to help drive the reliability engineering, operations and customer support services for ServiceNow SaaS implementation. Reporting to a Site Reliability Engineering & Operations Lead. This role requires delivering a range of SRE...
-
Site Reliability Engineer
2 weeks ago
Montréal, QC, Canada Open Systems Technologies Full timeThe Application Infrastructure (AI) department is seeking a Site Reliability Engineer (SRE) to help drive the reliability engineering, operations and customer support services for Morgan Stanley's ServiceNow SaaS implementation. Reporting to a Site Reliability Engineering & Operations Lead. This role requires delivering a range of SRE practices within a...
-
SRE Engineer
2 weeks ago
Markham, Toronto, Montreal, Calgary, Vancouver, Edmonton, Old Toronto, Ottawa, Mississauga, Quebec, Winnipeg, Halifax, Saskatoon, Burnaby, Hamilton, Surrey, Victoria, London, Halton Hills, Regina, Brampton, Vaughan, Kelowna, Laval, Southwestern Ontario, R, Canada kloia Full timeJoin to apply for the SRE Engineer role at kloia Description Kloia is a recognized AWS Premier Consulting Partner and CNCF member with a focus on Application Modernization and Digital Transition projects. Our teams are growing rapidly, and we’re hiring a Site Reliability Engineer primarily for our managed services provided to customers, as well as for...
-
Lead SRE/DevOps Engineer
3 weeks ago
Vancouver, Toronto, Montreal, Calgary, Edmonton, Old Toronto, Ottawa, Mississauga, Quebec, Winnipeg, Halifax, Saskatoon, Burnaby, Hamilton, Surrey, Victoria, London, Halton Hills, Regina, Markham, Brampton, Vaughan, Kelowna, Laval, Southwestern Ontario, R, Canada JobGet Full timeAbout JobGet As the #1 app focused on everyday workers, JobGet is redefining the future of hiring. Founded in 2019, JobGet began as the only mobile-first hiring platform for everyday workers. Since then, we’ve grown by joining forces with Snagajob, the largest hourly job board in the U.S., followed by Seasoned, the leading platform for restaurant hiring....
-
Senior Site Reliability Engineer
2 weeks ago
Quebec, Canada Orion Innovation Full timeThe Sr. SRE will be responsible for the reliability, scalability, and performance of systems supporting classified government projects in an air-gapped deployment. This role leverages advanced monitoring and DevOps tools to ensure uptime and compliance in a disconnected environment. Key Responsibilities Design and maintain highly reliable systems using RKE2,...