Site Reliability Engineer
2 weeks ago
Join us as a Senior Site Reliability Engineer to help us run an industry-scale GPU cluster via Kubernetes. Together with senior members of our team, you will combine your strong understanding of system scaling and security practices with your cloud-native expertise to stand up and maintain Kubernetes clusters from scratch. Your role will also be pivotal in supporting our other service offerings, from full-stack development to AI integration, ensuring they are robust, scalable, and secure. We need engineers on our team to be versatile, display leadership qualities and be enthusiastic to take on new problems across the stack as we solve new and interesting technologies problems. As a senior member of the team, you will be relied upon to design robust solutions that solve client problems, drive consensus around technical solutions, and ultimately own the success of projects. In return, you can expect latitude in the way you choose to run projects and design systems, while receiving direct support, guidance, and coaching from Bit Complete’s management team. What you'll be doing Develop and implement comprehensive infrastructure strategies that emphasize reliability, flexibility, and security. Manage and scale our cloud-native environments, including Kubernetes clusters and container orchestration. Oversee the deployment and maintenance of infrastructure tools. Lead initiatives on stateless architectures to enhance scalability and maintainability of our systems. Utilize your expertise in distributed systems using technologies like Kafka, Postgres, Redis, and Elasticsearch. Design and monitor CI/CD pipelines to streamline deployment processes using tools like Spinnaker. Implement and manage monitoring solutions using OpenTSDB, Prometheus, Grafana, and Envoy to ensure optimal performance and reliability. Provide leadership and direction to the infrastructure team, fostering a culture of continuous learning and improvement. Your Background Relevant industry experience, specifically in Site Reliability Engineering or a similar role, with a proven track record in technical leadership and setting the direction for scalable systems. Strong background in managing and deploying infrastructure in cloud-native environments (AWS and GCP). Experience with container orchestration (Docker, Kubernetes), and infrastructure as code (Terraform, Pulumi). Experience with monitoring and logging tools, and a solid understanding of network metrics. Familiarity with Linux skills and excellent problem-solving, debugging, and troubleshooting skills. Proficiency in system design and a solid understanding of distributed systems, DevOps tools and practices, particularly in developing and maintaining CI/CD systems for fully automated deployment, testing, and monitoring of applications. Familiarity with MLOps practices, including automation and orchestration of machine learning models. Experience with database technologies and designing infrastructure to support both traditional and AI-driven applications. Excellent communication skills with the ability to engage and influence both technical and non-technical stakeholders. About Us CAD $150,644 - $200,644 annually.Our ranges include base salary and conservative bonus target. Interested? We're excited about working with you, so get in touch Submit your application here The world of work today is overflowing with systems, processes, tools, and assumptions that are flawed and that can push directly against our ability to express what is unique about each of us in the work we do every day. We believe people from diverse backgrounds, with different identities and experiences, make our company better. No matter your background, we'd love to hear from you Alignment with our values is just as important as experience. Also, please let us know if there are ways we can make our interview process better for you - we're always happy to listen and accommodate where possible. #J-18808-Ljbffr
-
Site Reliability Engineer
6 days ago
(s): Canada : Ontario : Toronto Scotiabank Global Site Full timeRequisition ID: 245210Join a purpose driven winning team, committed to results, in an inclusive and high-performing culture.The TeamGlobal Banking and Markets Engineering (GBME) is the fast-moving, award-winning technology engine that powers Scotiabank's Corporate, Investment Banking and Capital Markets businesses.The RoleGBME is searching for a Site...
-
Site Reliability Engineer
1 week ago
(s): Canada : Ontario : Toronto Scotiabank Global Site Full timeRequisition ID: 244026Join a purpose driven winning team, committed to results, in an inclusive and high-performing culture.Overview: As a Site Reliability Engineer (SRE), you will join the Digital Engineering Operations team, responsible for ensuring the operations and reliability of Scotiabank digital applications. You will have the opportunity to drive...
-
Site Reliability Engineer
1 week ago
(s): Canada : Ontario : Toronto Scotiabank Global Site Full timeRequisition ID: 247129Join a purpose driven winning team, committed to results, in an inclusive and high-performing culture.As a SRE, you will implement, measure and gather insights from Operational Level Indicators identifying areas for service improvements covering availability, performance, resilience, incidents and chronic problems. You will implement...
-
Site Reliability Engineer
1 week ago
Canada Dayforce Full timeAbout the OpportunityAs a Site Reliability Engineer at Dayforce, you will be part of a pioneering team responsible for ensuring our industry-leading HCM platform delivers exceptional scalability, availability, and reliability. Dayforce is a global HCM technology company with operations across North America, EMEA, and APJ, and our award-winning cloud platform...
-
Senior Site Reliability Engineer
2 weeks ago
, , Canada Thinkific Full timeJoin to apply for the Senior Site Reliability Engineer role at Thinkific Join to apply for the Senior Site Reliability Engineer role at Thinkific Are you an experienced Site Reliability Engineer looking for a new challenge? We’re looking for a Senior Site Reliability Engineer to join us at Thinkific. We’re looking for a Senior Site Reliability Engineer...
-
Senior Site Reliability Engineer
2 weeks ago
, , Canada DuckDuckGo Full time6 days ago Be among the first 25 applicants Get AI-powered advice on this job and more exclusive features. Who We AreHi, we're DuckDuckGo, the online protection company and remote-first team of 300+ on a mission to raise the standard of trust online. Founded in 2008 and profitable since 2014, our annual revenue now exceeds $100 million USD. Millions use our...
-
Site Reliability Engineer
2 weeks ago
, , Canada Dayforce Full timeBase pay range CA$67,700.00/yr - CA$120,900.00/yr Dayforce is a global human capital management (HCM) company headquartered in Toronto, Ontario, and Minneapolis, Minnesota, with operations across North America, Europe, Middle East, Africa (EMEA), and the Asia Pacific Japan (APJ) region. Our award‑winning Cloud HCM platform offers a unified solution...
-
Site Reliability Engineer
2 weeks ago
, , Canada mthree Recruiting Portal Full timeMarket leading investment bank requires a Site Reliability Engineer join their Technology Operations Management department. The team is responsible to allow the Firm to manage its technology and data related risks. The department are entrusted with the responsibility of protecting the financial interests of millions world-wide, they are required to ensure...
-
Senior Site Reliability Engineer
2 weeks ago
, , Canada TextNow Full timeThis range is provided by TextNow. Your actual pay will be based on your skills and experience — talk with your recruiter to learn more. Base pay range CA$113,400.00/yr - CA$162,000.00/yr We believe communication belongs to everyone. We exist to democratize phone service. TextNow is evolving the way the world connects and that\'s because we\'re made up of...
-
Manager, Site Reliability Engineer
4 weeks ago
, , Canada Command Alkon Incorporated. Full timeTitle: Manager, Site Reliability Engineer (SRE) Summary of Role The Site Reliability Engineer (SRE) Manager leads the teams responsible for ensuring the availability, performance, and reliability of mission‑critical systems. This role bridges the gap between software engineering and operations by implementing automation, observability, and scalability...