Course Catalog
DevOps Institute: Site Reliability Engineering (SRE) Practitioner
Code: DOI SRE Practitio
Duration: 3 Day
$2495 USD

OVERVIEW

Today’s organizations deal with a higher volume of change in a more complex tech environment leading to a higher risk of outages and incidents. IT teams must improve service reliability and system resiliency. With automation and observability becoming key factors for more efficient and rapid deployments, the Sight Reliability Engineering (SRE) profile has become one of the fastest-growing enterprise roles and set of operational practices for managing services at scale.

The DevOps Institute SRE Practitioner? course provides a practical view of how to successfully implement a flourishing SRE culture in your organization. This 3-day course is a practical progression for DOI SRE Foundation? certificate holders.

DELIVERY FORMAT

This course is available in the following formats:

Virtual Classroom

Duration: 3 Day

CLASS SCHEDULE

Delivery Format: Virtual Classroom
Date: Jun 10 2024 - Jun 12 2024 | 08:30 - 16:30 EDT
Location: Online
Course Length: 3 Day

$ 2495

Delivery Format: Virtual Classroom
Date: Aug 05 2024 - Aug 07 2024 | 08:30 - 16:30 EDT
Location: Online
Course Length: 3 Day

$ 2495

Delivery Format: Virtual Classroom
Date: Oct 07 2024 - Oct 09 2024 | 08:30 - 04:30 EDT
Location: Online
Course Length: 3 Day

$ 2495

Delivery Format: Virtual Classroom
Date: Dec 02 2024 - Dec 04 2024 | 09:00 - 17:00 EST
Location: Online
Course Length: 3 Day

$ 2495

GOALS
  • Practical view of how to successfully implement a flourishing SRE culture in your organization
  • The underlying principles of SRE and an understanding of what it is not in terms of antipatterns
  • Organizational impact of introducing SRE. SLIs and SLOs in a distributed ecosystem and extending the usage of Error Budgets
  • Building security and resilience by design in a distributed, zero-trust environment
  • Implementing full-stack observability, distributed tracing and Observability-driven development culture
  • Curating data using AI to move from reactive to proactive and predictive incident management
  • Using DataOps to build clean data lineage
  • Why Platform Engineering is important in building consistency and predictability
  • Implementing practical Chaos Engineering
  • Major incident response responsibilities
  • SRE Execution model
OUTLINE

Will Be Updated Soon!

Module 1: SRE Anti-Patterns

  • Break the ice with a recap of DevOps Institute’s SRE Blueprint
  • Discuss how SRE works in a distributed ecosystem
  • Discuss some of the SRE Barriers
  • A few SRE Anti-Patterns (discuss the right patterns too)
  • Discuss the Case Story of how Monzo bank learned from causes leading to SEV1 issue
  • Case Story: Monzo Bank
  • Discussion / Exercise: Good versus Bad Postmortem, Describe a Major Incident, Anti-Patterns of SRE

Module 2: SLO is a Proxy for Customer Happiness

  • What has changed with SLO?
  • Identifying System boundaries for setting SLIs is critical
  • How do you use Error Budgets beyond the velocity versus stability debate?
  • Case Story: Kudos Engineering, Home Depot
  • Discussion / Exercise: Establishing SLOs in Distributed Ecosystems

Module 3: Building Secure and Reliable Systems

  • Building Secure and Reliable systems
  • Non-Abstract Large Scale Design
  • Designing for the changing Architecture and distributed ecosystem
  • Fault tolerant Design
  • Designing for Security
  • Designing for Resiliency
  • Case Story: Chrome Security Team
  • Discussion / Exercise: Non-Abstract Large Scale Design – Capacity

Module 4: Full Stack Observability

  • Modern Apps are Complex & Unpredictable
  • Slow is the New Down
  • Pillars of Observability
  • Using Open Telemetry
  • Case Story: Planet Labs
  • Discussion / Exercise: How do you bake Observability in your Code

Module 5: Platform Engineering and AIOps

  • Taking a Platform Centric View
  • How do you use AIOps to improve Resiliency
  • How can DataOps help you in the journey
  • A simple recipe to implement AIOps
  • Indicative measurement of AIOps
  • Case Story: FedEx, 3M
  • Discussion / Exercise: Instrumenting AIOps using Prometheus

Module 6: SRE and Incident Response Management

  • SRE Key Responsibilities towards incident response
  • DevOps & SRE and ITSM (new vs. old ways)
  • OODA and SRE Incident Response
  • SRE and CLR (closed loop remediation)
  • Swarming – Food for Thought
  • AI/ML for better Incident Management
  • Case Story: HCL AIOps Journey
  • Discussion / Exercise: Teams to discuss about Swarming and Tier Layered Incident Response framework

Module 7: Chaos Engineering

  • Navigating Complexity
  • Chaos Engineering Defined
  • Quick Facts
  • Chaos Monkey Origin Story
  • Who is adopting Chaos Engineering
  • Myths of Chaos
  • Chaos Engineering Experiments
  • GameDay Exercises
  • Security Chaos Engineering
  • Chaos Engineering Resources
  • Discussion / Exercise: Instrumenting Gremlin, Discuss how to conduct a GameDay exercise

Module 8: SRE is the Purest Form of DevOps

  • Key Principles of SRE
  • SREs help increase Reliability across the spectrum
  • Metrics for Success
  • SRE Execution models
  • Culture and Behavioral Skills are key
  • Transformation after implementing SRE practices
  • Case Story: Airbnb
  • Discussion / Exercise: Discuss NALSD learnings from Module, Transformation after implementing SRE practices
LABS

Will Be Updated Soon!
Will Be Updated Soon!
WHO SHOULD ATTEND
  • IT leaders & managers
  • Organizational change leaders and agents
  • SRE engineeers
  • System Integrators
  • Business Stakeholders
  • DevOps Practitioners
  • System Integrators
  • Scrum Masters/Product Owners
  • Software Engineers
PREREQUISITES

It is highly recommended that learners attend the SRE Foundation course and earn the SRE Foundation certification prior to attending the SRE Practitioner course and exam. An understanding and knowledge of common SRE terminology, concepts, principles and related work experience are recommended.