AWS SRE Engineer

2 weeks ago


Vancouver, Canada Microsoft Corporation Full time
p>Microsoft is a company where passionate innovators come to collaborate, envision what can be and take their careers further. This is a world of more possibilities, more innovation, more openness, and the sky is the limit thinking in a cloud-enabled world.

Microsoft’s Azure Data engineering team is leading the transformation of analytics in the world of data with products like databases, data integration, big data analytics, messaging & real-time analytics, and business intelligence. The products in our portfolio include Microsoft Fabric, Azure SQL DB, Azure Cosmos DB, Azure PostgreSQL, Azure Data Factory, Azure Synapse Analytics, Azure Service Bus, Azure Event Grid, and Power BI. Our mission is to build the data platform for the age of AI, powering a new class of data-first applications and driving a data culture.

Within Azure Data, the databases team builds and maintains Microsoft's operational Database systems. We store and manage data in a structured way to enable a multitude of applications across various industries. We are on a journey to enable developer-friendly, mission-critical, AI-enabled operational Databases across relational, non-relational and Open Source Software (OSS) offerings.

We believe in making the day in the life of the On-Call Engineer boring while living up to the expectations of a massive cloud service with stringent Service Level Objectives (SLO’s). We do this by thinking differently, stretching ourselves to go all the way to the root of the problem, keeping data in front and center for all our decisions and taking a systems approach for generating outcomes that far exceed the expectations. Helping attain the aspirational Service Level Objectives (SLO’s) through pragmatic innovation is what sets the SRE’s in Cosmos DB apart. p>

Azure Cosmos DB is Microsoft’s next generation of globally distributed, massively scalable, multi-model cloud database service. It is designed to enable developers to build planet-scale applications. Azure Cosmos DB is one of the fastest growing Azure services. Joining the Azure Cosmos DB team is a fantastic opportunity to work with incredibly talented engineers operating like a startup and be at the forefront of building and shaping the Livesite Automation and AI Ops stack in Cosmos DB and lead the path for broader adoption across Microsoft Azure.

Cosmos DB is a database of choice for the spectrum spanning from the hobbyist developer to the largest of Fortune 500 companies. The database provides the data backbone of many critical systems in Health Care, Retail, Telecommunications, IoT etc. where the Service Availability and Latency is paramount. Cosmos DB provides financially backed SLA (service level agreements) around 99.99 Availability and < 10 MS Latency and we take pride in upholding ourselves to even more stringent Service Level Objectives (SLO) that delight our customers. Other than a resilient and fault-tolerant architecture, a key to attaining the SLO’s is automating the root cause analysis and mitigation of Issues and a lot of times proactively addressing the issues even before any customer impact. This team prides itself on building systems where a vast majority of Livesite issues are automatically mitigated without the need for human intervention.

We are looking for a self-driven Senior Site Reliability Engineer (SRE) who likes taking a data-driven and systems-based approach to solve Service Reliability problems. You will be responsible for building and optimizing solutions that can analyze massive amounts of telemetry and other Service Health indicators in near real time and perform automated root cause analysis and necessary mitigations to restore SLO’s.

Our team focuses on diversity of all types of candidates for our roles and we strive to hire people with different experiences and perspectives into our team. p>Qualifications

  • 6+ years technical experience in software engineering, network engineering, or systems administration
    • OR Bachelor's Degree in Computer Science, Information Technology, or related field AND 3+ years technical experience in software engineering, network engineering, or systems administration
    • OR Master's Degree in Computer Science, Information Technology, or related field AND 2+ years technical experience in software engineering, network engineering, or systems administration.

Other Requirements

  • Ability to meet Microsoft, customer and/or government security screening requirements are required for this role. These requirements include, but are not limited to the following specialized security screenings:
    • Microsoft Cloud Background Check: This position will be required to pass the Microsoft Cloud background check upon hire/transfer and every two years thereafter.

Preferred/Additional Qualifications

  • Understanding of Observability and MELT implementation patterns for large-scale services.
  • Experience in Logic Apps and authoring Jupyter Notebooks, and experience in analyzing, troubleshooting, and automating root cause analysis and mitigation of incidents impacting large-scale distributed systems.
  • 5+ years of SRE or SWE (Software Engineer) experience running large scale cloud services and 5+ years of hands-on experience in Python/Java/C#.
  • 3+ years of operational experience in improving Service Reliability, Availability and Performance.
Responsibilities
  • Collaborating closely with engineering teams on building and enhancing tooling and automation solutions for faster resolution of issues impacting SLO’s and averting incidents altogether when possible./>Communicate on a deeply technical level and be the single point of contact for interfacing with large enterprise customers for handling service escalations and driving the issues to resolution.
  • Ability to design and implement any changes to service telemetry for the automation to consume if it is not already available.
  • Analyze data and provide operational insights into customer experience to Design and Product teams, so that we can design features with Supportability in mind.

Site Reliability Engineering IC4 - The typical base pay range for this role across Canada is CAD $108,100 - CAD $199,700 per year.

Microsoft will accept applications for the role until November 21, 2024.


  • AWS SRE Engineer

    1 week ago


    Vancouver, Canada Take-Two Interactive Software, Inc. Full time

    h3>Who We AreHeadquartered in New York City, Take-Two Interactive Software, Inc. is a leading developer, publisher, and marketer of interactive entertainment for consumers around the globe. The Company develops and publishes products principally through Rockstar Games, 2K, Private Division, and Zynga. Our products are currently designed for console gaming...

  • AWS SRE Engineer

    2 weeks ago


    Vancouver, Canada Royal Bank of Canada> Full time

    The Application Support SRE will be responsible for the support, development, and implementation of Site Reliability Engineering solutions for all applications within City National Bank (CNB), an RBC company. This individual will need advanced knowledge and experience working in an application development, support and/or technology operations organization....

  • AWS SRE Engineer

    6 days ago


    Vancouver, Canada Take-Two Interactive Full time

    h3>Who We AreHeadquartered in New York City, Take-Two Interactive Software, Inc. is a leading developer, publisher, and marketer of interactive entertainment for consumers around the globe. The Company develops and publishes products principally through Rockstar Games, 2K, Private Division, and Zynga. Our products are currently designed for console gaming...

  • Cloud Engineer

    2 weeks ago


    Vancouver, British Columbia, Canada Insight Global Full time

    Insight Global seeks a highly skilled Cloud Engineer to lead the migration of applications to new AWS instances and build scalable CI/CD pipelines.The ideal candidate will have 5+ years of experience as a DevOps Engineer, System Administrator or SRE with hands-on experience in building out CI/CD pipelines, migrating to new AWS instances, and implementing...


  • Vancouver, Canada NetApp Full time

    Title: Site Reliability Engineer (SRE) Location: Bangalore, Karnataka, IN, 560071 Requisition ID: 127074 Job SummaryAs a Site Reliability Engineer (SRE) with a specialization in storage, you'll manage and optimize a portfolio of customer-facing cloud services (SaaS/IaaS) on Google Cloud Platform (GCP), ensuring their overall availability, performance,...

  • SRE Product Manager

    1 month ago


    Vancouver, Canada Randstad Canada Full time

    Are you an experienced Product Manager in search of your next contract opportunity? Our high-profile client is seeking to hire an SRE Project Manager to join their talented team on a 6-month contract with a strong probability of extension. Apply for this amazing opportunity if this sounds like a good fit for you!AdvantagesWhat’s in it for you!As a...


  • Vancouver, British Columbia, Canada Insight Global Full time

    Job DescriptionWe are seeking a Senior Cloud Infrastructure Engineer to join our DSE team at Insight Global. As a key member of this team, you will be responsible for building and maintaining our cloud infrastructure, including CI/CD pipelines and migrating applications to new AWS instances.The ideal candidate will have 5+ years of experience in DevOps...


  • Vancouver, British Columbia, Canada Take-Two Interactive Software, Inc. Full time

    We are seeking a highly skilled Senior Site Reliability Engineer to join our Direct to Consumer team at Take-Two Interactive Software, Inc. This role will be based in New York City and will come with a competitive salary of $170,000 - $220,000 per year.About the RoleThis is an exciting opportunity for a talented SRE to support our infrastructure, monitoring,...


  • Vancouver, Canada Microsoft Canada Full time

    Are you interested in working for one of the most exciting teams at Microsoft? Then look no further than Microsoft Teams SRE team. You will be building solutions that leverage state-of-the-art technologies to deliver the next evolution in collaboration and teamwork. What is a Site Reliability Engineer (SRE)? SRE is what you get when you treat operations as...


  • Vancouver, Canada Conexiom Full time

    About the Opportunity: Conexiom is seeking a dedicated and experienced Site Reliability Engineering (SRE) Senior Manager to lead our SRE team. The role involves leading the Cloud SRE team in day-to-day operations, which include monitoring, support activities, ensuring customer satisfaction through reliable service, and building and designing cloud...


  • Vancouver, Canada Microsoft Canada Full time

    Are you interested in working for one of the most exciting teams at Microsoft? Then look no further than Microsoft Teams SRE team. You will be building solutions that leverage state-of-the-art technologies to deliver the next evolution in collaboration and teamwork. What is a Site Reliability Engineer (SRE)? SRE is what you get when you treat operations as...

  • DevOps Engineer

    1 week ago


    Vancouver, Canada Insight Global Full time

    Day By Day:Insight Global is looking for a DevOps Engineer to join the DSE team which oversees Local Outreach, Digital Store Pages, Store Service API, and InStore Digital (ISD). Local Outreach is an internal application used by our stores to communicate with Guests in their community about events and opportunities they may be interested in. Digital Store...


  • Vancouver, British Columbia, B6B, British Columbia, Canada Microsoft Canada Full time

    Are you interested in working for one of the most exciting teams at Microsoft? Then look no further than Microsoft Teams SRE team. You will be building solutions that leverage state-of-the-art technologies to deliver the next evolution in collaboration and teamwork.What is a Site Reliability Engineer (SRE)? SRE is what you get when you treat operations as if...

  • DevOps Engineer

    2 weeks ago


    Vancouver, Canada Insight Global Full time

    Day By Day:Insight Global is looking for a DevOps Engineer to join the DSE team which oversees Local Outreach, Digital Store Pages, Store Service API, and InStore Digital (ISD). Local Outreach is an internal application used by our stores to communicate with Guests in their community about events and opportunities they may be interested in. Digital Store...

  • DevOps Engineer

    2 weeks ago


    Vancouver, Canada Insight Global Full time

    Day By Day:Insight Global is looking for a DevOps Engineer to join the DSE team which oversees Local Outreach, Digital Store Pages, Store Service API, and InStore Digital (ISD). Local Outreach is an internal application used by our stores to communicate with Guests in their community about events and opportunities they may be interested in. Digital Store...


  • Vancouver, Canada NearSource Full time

    p>Make your mark as a Principal DevOps Engineer on a multinational Fortune 500 Project in Canada. br/>ResponsibilitiesDesign and implement new service offerings on top of a strong cloud foundationSupport development teams in leveraging our larger machine learning operations (MLOps)Drive the design, implementation, and management for expanding observability...


  • Vancouver, Canada NearSource Full time

    Make your mark as a Principal DevOps Engineer on a multinational Fortune 500 Project in Canada. Shape innovative solutions and drive technological excellence. Apply now to be a valued member of the dynamic team.ResponsibilitiesDesign and implement new service offerings on top of a strong cloud foundationDrive the design, implementation, and management for...

  • Chief DevOps Engineer

    23 hours ago


    Vancouver, Canada NearSource Full time

    p>Make your mark as a Principal DevOps Engineer on a multinational Fortune 500 Project in Canada. br>ResponsibilitiesDesign and implement new service offerings on top of a strong cloud foundationDrive the design, implementation, and management for expanding observability infrastructure, keeping up to date with new technologiesLead sustainable incident...


  • Vancouver, Canada NearSource Full time

    Make your mark as a Principal DevOps Engineer on a multinational Fortune 500 Project in Canada. Shape innovative solutions and drive technological excellence. Apply now to be a valued member of the dynamic team.ResponsibilitiesDesign and implement new service offerings on top of a strong cloud foundationSupport development teams in leveraging our larger...

  • Devops Engineer

    2 weeks ago


    Vancouver, BC, Canada Insight Global Full time

    Day By Day: Insight Global is looking for a DevOps Engineer to join the DSE team which oversees Local Outreach, Digital Store Pages, Store Service API, and InStore Digital (ISD). Local Outreach is an internal application used by our stores to communicate with Guests in their community about events and opportunities they may be interested in. Digital Store...