Staff Infrastructure Site Reliability Engineer
1 week ago
Staff Infrastructure Site Reliability Engineer Staff Infrastructure Site Reliability Engineer Posted: 04/05/2025 Anywhere in the world Remote Senior About the Team: Netlify’s SRE team is scaling to meet the demands of our rapidly growing platform and user base. Our SRE team is responsible for ensuring the reliability, scalability, and efficiency of Netlify’s infrastructure while maintaining a focus on innovation and operational excellence. As a Staff Site Reliability Engineer, you will be at the forefront of driving organizational-level reliability strategies, shaping the direction of Netlify’s systems, and tackling complex, systemic challenges. You will collaborate across teams to build a culture of operational excellence and deliver impactful solutions that support our mission to empower the next generation of web developers. We are a remote-first, globally distributed group that values asynchronous communication, documentation, and a culture of transparency, empowerment, and collective ownership. Diversity and inclusion are at the heart of what we do, and we welcome team members from all backgrounds to bring their unique perspectives to our mission. Whether you’re launching a new phase of your career or growing an established one, Netlify offers a supportive environment where you can thrive while maintaining a healthy work-life balance. What You’ll Do: Lead high-impact reliability and infrastructure initiatives across the platform. Drive the adoption of Infrastructure-as-Code and champion reliability-focused tooling and frameworks. Manage all cloud infrastructure components, including instances, networking, DNS, Terraform automation, and Kubernetes. Define and uphold architectural standards, best practices, and technical strategy for reliability at scale. Provide mentorship to senior engineers and tech leads, fostering systems thinking and operational excellence. Partner with Engineering, Product, and Executive teams to embed reliability into company-wide strategy. Lead architecture reviews and provide oversight for critical infrastructure projects. Develop and advocate for reliability metrics and SLO frameworks that align with business goals. Participate in an on-call rotation and occasionally act as Incident Commander, providing technical leadership and system-level decision-making. What You’ll Bring: Deep expertise in cloud architecture, with hands-on experience designing and deploying global-scale solutions on AWS, Azure, or GCP. Strong proficiency with Kafka or similar messaging systems, including deployment, scaling, and maintenance in multi-cloud environments. Solid experience in database design, performance tuning, and maintenance for both relational and NoSQL systems in high-throughput environments. Skilled in programming and scripting languages such as Go or Python, with a focus on automation and infrastructure tooling. A proven track record of leading large-scale, cross-team technical initiatives and delivering impactful infrastructure outcomes. Proficiency in configuration management tools like Ansible, Chef, or Puppet. Experience in managing CI/CD pipelines using tools such as Jenkins, GitLab CI, CircleCI, or similar. We welcome candidates based in Spain, Canada, or the UK for this position. Excellent communication skills, with the ability to articulate complex technical strategies to executives and build consensus across diverse teams. Demonstrated success in setting and scaling technical standards and best practices across large engineering organizations. This role is a great fit if: You think in systems. You’re curious about how infrastructure, networking, observability, and security connect—and enjoy breaking down complex challenges into clear, actionable strategies. You’re comfortable writing code (especially in Go) and enjoy automating infrastructure workflows, building tools to reduce manual effort, and supporting reliable operations at scale. You’ve collaborated on cross-functional initiatives—like operational readiness reviews, cloud migrations, or introducing monitoring standards—and know how to communicate clearly with both technical and non-technical teammates. You take a thoughtful, methodical approach to troubleshooting. You seek context before jumping to solutions, validate assumptions, and can clearly explain how you navigate production issues or potential incidents. You work well in a distributed environment and value clear, respectful communication. Whether async or live, you prioritize inclusivity, documentation, and creating space for others to contribute. You’re energized by helping others grow—whether that’s through mentoring, sharing knowledge, or building systems that support better outcomes across the team. You approach reliability as a proactive practice, not just a reactive one. You care about preventing issues before they become incidents and building systems that help everyone sleep better at night. You’re drawn to big, interesting challenges. The idea of helping shape a global CDN, support edge computing innovation, and rethink infrastructure for modern developers is what motivates you. Applying: Not sure you meet 100% of our qualifications? Please apply anyway We value diverse experiences and perspectives. When applying, please include: A resume or short listing of your job history & skills (a LinkedIn profile link is fine). (Optional) A cover letter explaining why you would enjoy this role at Netlify. Our mission to build a better web relies on a diversity of skill sets, backgrounds, and thoughts. Netlify is an Equal Opportunity Employer, and we are committed to building a team that reflects our values of inclusivity and equity. If accommodations are needed for the interview process, please email The platform developers love for building highly-performant and dynamic websites, ecommerce stores, and apps. #J-18808-Ljbffr
-
Staff Site Reliability Engineer, Database
2 weeks ago
, , Canada Alpaca Full timeStaff Site Reliability Engineer, Database Who We Are: Alpaca is a US-headquartered self-clearing broker‑dealer and brokerage infrastructure for stocks, ETFs, options, crypto, fixed income, 24/5 trading, and more. Our recent Series C funding round brought our total investment to over $170 million, fueling our ambitious vision. Amongst our subsidiaries,...
-
Site Reliability Engineer
23 hours ago
(s): Canada : Ontario : Toronto Scotiabank Global Site Full timeRequisition ID: 244026Join a purpose driven winning team, committed to results, in an inclusive and high-performing culture.Overview: As a Site Reliability Engineer (SRE), you will join the Digital Engineering Operations team, responsible for ensuring the operations and reliability of Scotiabank digital applications. You will have the opportunity to drive...
-
, , Canada Oscilar Full timeOverview Join to apply for the DevOps/Site Reliability Engineer (SRE) role at Oscilar . Get AI-powered advice on this job and more exclusive features. Shape the future of trust in the age of AI At Oscilar, we're building the most advanced AI Risk Decisioning Platform. Banks, fintechs, and digitally native organizations rely on us to manage their fraud,...
-
Senior Site Reliability Engineer, Infrastructure
3 weeks ago
, , Canada Wealthsimple Full timeSenior Site Reliability Engineer, Infrastructure Join to apply for the Senior Site Reliability Engineer, Infrastructure role at Wealthsimple . Wealthsimple is on a mission to help everyone achieve financial freedom by reimagining what it means to manage your money. Using smart technology, we take financial services that are often confusing, opaque and...
-
Senior Site Reliability Engineer, Infrastructure
3 weeks ago
, , Canada Wealthsimple Full timeYour career is an investment that grows over time! Wealthsimple is on a mission to help everyone achieve financial freedom by reimagining what it means to manage your money. Using smart technology, we take financial services that are often confusing, opaque and expensive and make them transparent and low-cost for everyone. We’re the largest fintech company...
-
Senior Site Reliability Engineer, Infrastructure
3 weeks ago
, , Canada Wealthsimple Full timeA fintech company in Canada seeks a Senior Site Reliability Engineer to enhance system reliability and scalability. The ideal candidate will leverage their experience with Ruby, SQL, AWS, and Kubernetes to improve core infrastructure. Responsibilities include addressing infrastructure gaps and improving system observability. This role offers competitive...
-
Site Reliability Engineer
3 weeks ago
, MB, Canada MongoDB Full timeSite Reliability Engineer (Senior or Staff), Fabric Join to apply for the Site Reliability Engineer (Senior or Staff), Fabric role at MongoDB . The Team Platform Engineering is the department within SRE that is responsible for a range of critical infrastructure and operational functions that support the broader engineering organization. Among these are our...
-
Senior Site Reliability Engineer
1 week ago
, , Canada Thinkific Full timeJoin to apply for the Senior Site Reliability Engineer role at Thinkific Join to apply for the Senior Site Reliability Engineer role at Thinkific Are you an experienced Site Reliability Engineer looking for a new challenge? We’re looking for a Senior Site Reliability Engineer to join us at Thinkific. We’re looking for a Senior Site Reliability Engineer...
-
Site Reliability Engineer
3 days ago
Canada Dayforce Full timeAbout the OpportunityAs a Site Reliability Engineer at Dayforce, you will be part of a pioneering team responsible for ensuring our industry-leading HCM platform delivers exceptional scalability, availability, and reliability. Dayforce is a global HCM technology company with operations across North America, EMEA, and APJ, and our award-winning cloud platform...
-
Senior Site Reliability Engineer
4 days ago
, , Canada Paxos Full timeAbout Paxos Today’s financial infrastructure is archaic, expensive, inefficient and risky — supporting a system that leaves out more people than it lets in. So we’re rebuilding it. We’re on a mission to open the world’s financial system to everyone by enabling the instant movement of any asset, any time, in a trustworthy way. For over a decade,...