Site Reliability Engineer
4 weeks ago
Join us as a Senior Site Reliability Engineer to help us run an industry-scale GPU cluster via Kubernetes. Together with senior members of our team, you will combine your strong understanding of system scaling and security practices with your cloud-native expertise to stand up and maintain Kubernetes clusters from scratch. Your role will also be pivotal in supporting our other service offerings, from full-stack development to AI integration, ensuring they are robust, scalable, and secure. We need engineers on our team to be versatile, display leadership qualities and be enthusiastic to take on new problems across the stack as we solve new and interesting technologies problems. As a senior member of the team, you will be relied upon to design robust solutions that solve client problems, drive consensus around technical solutions, and ultimately own the success of projects. In return, you can expect latitude in the way you choose to run projects and design systems, while receiving direct support, guidance, and coaching from Bit Complete’s management team. What you'll be doing Develop and implement comprehensive infrastructure strategies that emphasize reliability, flexibility, and security. Manage and scale our cloud-native environments, including Kubernetes clusters and container orchestration. Oversee the deployment and maintenance of infrastructure tools. Lead initiatives on stateless architectures to enhance scalability and maintainability of our systems. Utilize your expertise in distributed systems using technologies like Kafka, Postgres, Redis, and Elasticsearch. Design and monitor CI/CD pipelines to streamline deployment processes using tools like Spinnaker. Implement and manage monitoring solutions using OpenTSDB, Prometheus, Grafana, and Envoy to ensure optimal performance and reliability. Provide leadership and direction to the infrastructure team, fostering a culture of continuous learning and improvement. Your Background Relevant industry experience, specifically in Site Reliability Engineering or a similar role, with a proven track record in technical leadership and setting the direction for scalable systems. Strong background in managing and deploying infrastructure in cloud-native environments (AWS and GCP). Experience with container orchestration (Docker, Kubernetes), and infrastructure as code (Terraform, Pulumi). Experience with monitoring and logging tools, and a solid understanding of network metrics. Familiarity with Linux skills and excellent problem-solving, debugging, and troubleshooting skills. Proficiency in system design and a solid understanding of distributed systems, DevOps tools and practices, particularly in developing and maintaining CI/CD systems for fully automated deployment, testing, and monitoring of applications. Familiarity with MLOps practices, including automation and orchestration of machine learning models. Experience with database technologies and designing infrastructure to support both traditional and AI-driven applications. Excellent communication skills with the ability to engage and influence both technical and non-technical stakeholders. About Us CAD $150,644 - $200,644 annually.Our ranges include base salary and conservative bonus target. Interested? We're excited about working with you, so get in touch Submit your application here The world of work today is overflowing with systems, processes, tools, and assumptions that are flawed and that can push directly against our ability to express what is unique about each of us in the work we do every day. We believe people from diverse backgrounds, with different identities and experiences, make our company better. No matter your background, we'd love to hear from you Alignment with our values is just as important as experience. Also, please let us know if there are ways we can make our interview process better for you - we're always happy to listen and accommodate where possible. #J-18808-Ljbffr
-
Site Reliability Engineer
5 days ago
(s): Canada : Ontario : Toronto Scotiabank Global Site Full timeRequisition ID: 244027Join a purpose driven winning team, committed to results, in an inclusive and high-performing culture.Overview: As a Site Reliability Engineer (SRE), you will join the Digital Engineering Operations team, responsible for ensuring the operations and reliability of Scotiabank digital applications. You will have the opportunity to drive...
-
Site Reliability Engineer
5 days ago
(s): Canada : Ontario : Toronto Scotiabank Global Site Full timeRequisition ID: 244026Join a purpose driven winning team, committed to results, in an inclusive and high-performing culture.Overview: As a Site Reliability Engineer (SRE), you will join the Digital Engineering Operations team, responsible for ensuring the operations and reliability of Scotiabank digital applications. You will have the opportunity to drive...
-
Site Reliability Engineer
3 weeks ago
, , Canada SPECTRAFORCE Full timeJob Title: DevOps/Site Reliability Engineer Duration: 12+ months Core hours of the position: somewhat flexible, but able to attend meetings and collaborate with team members between 8 am Pacific and 3 pm Pacific. Team members are located in Pacific, Mountain, Central, and East time zones Top 3 items to see on resumes 5+ years of experience in DevOps, Site...
-
Site Reliability Engineer
3 weeks ago
Canada SPECTRAFORCE Full timeJob Title: DevOps/Site Reliability Engineer Duration: 12+ months Locations: Ontario, Toronto, Vancouver, Montreal (100% remote) Core hours of the position: somewhat flexible, but able to attend meetings and collaborate with team members between 8 am Pacific and 3 pm Pacific. Team members are located in Pacific, Mountain, Central, and East time zones Top 3...
-
Site Reliability Engineer
3 weeks ago
Canada SPECTRAFORCE Full timeJob Title: DevOps/Site Reliability Engineer Duration: 12+ months Locations: Ontario, Toronto, Vancouver, Montreal (100% remote) Core hours of the position: somewhat flexible, but able to attend meetings and collaborate with team members between 8 am Pacific and 3 pm Pacific. Team members are located in Pacific, Mountain, Central, and East time zones Top...
-
Site Reliability Engineer
3 weeks ago
Canada SPECTRAFORCE Full timeJob Title: DevOps/Site Reliability Engineer Duration: 12+ months Locations: Ontario, Toronto, Vancouver, Montreal (100% remote) Core hours of the position: somewhat flexible, but able to attend meetings and collaborate with team members between 8 am Pacific and 3 pm Pacific. Team members are located in Pacific, Mountain, Central, and East time zones Top 3...
-
Site Reliability Engineer
3 weeks ago
Canada SPECTRAFORCE Full timeJob Title: DevOps/Site Reliability Engineer Duration: 12+ months Locations: Ontario, Toronto, Vancouver, Montreal (100% remote) Core hours of the position: somewhat flexible, but able to attend meetings and collaborate with team members between 8 am Pacific and 3 pm Pacific. Team members are located in Pacific, Mountain, Central, and East time zones Top 3...
-
Site Reliability Engineer
3 weeks ago
Canada Blue Signal Search Full timeSite Reliability Engineer Location: Remote, Canada Our client is a fast-growing provider of AI-driven edge-computing platforms that keep industrial operations safe, smart, and always on. Their distributed hardware and software suite processes high-volume video and sensor data at the edge, delivering real-time insight for customers who cannot afford...
-
Site Reliability Engineer
3 weeks ago
Canada Blue Signal Search Full timeSite Reliability Engineer Location: Remote, Canada Our client is a fast-growing provider of AI-driven edge-computing platforms that keep industrial operations safe, smart, and always on. Their distributed hardware and software suite processes high-volume video and sensor data at the edge, delivering real-time insight for customers who cannot afford downtime....
-
Site Reliability Engineer
3 weeks ago
Canada Blue Signal Search Full timeSite Reliability Engineer Location: Remote, Canada Our client is a fast-growing provider of AI-driven edge-computing platforms that keep industrial operations safe, smart, and always on. Their distributed hardware and software suite processes high-volume video and sensor data at the edge, delivering real-time insight for customers who cannot afford downtime....