DevOps Institute: Site Reliability Engineering (SRE) Practitioner
Code:
DOI SRE Practitio
Duration:
3 Day
|
$2495
USD
|
Todays organizations deal with a higher volume of change in a more complex tech environment leading to a higher risk of outages and incidents. IT teams must improve service reliability and system resiliency. With automation and observability becoming key factors for more efficient and rapid deployments, the Sight Reliability Engineering (SRE) profile has become one of the fastest-growing enterprise roles and set of operational practices for managing services at scale.
The DevOps Institute SRE Practitioner? course provides a practical view of how to successfully implement a flourishing SRE culture in your organization. This 3-day course is a practical progression for DOI SRE Foundation? certificate holders.
This course is available in the following formats:
Duration: 3 Day
Delivery Format: Virtual Classroom
|
$ 2495 |
|
Delivery Format: Virtual Classroom
|
$ 2495 |
|
Delivery Format: Virtual Classroom
|
$ 2495 |
|
Delivery Format: Virtual Classroom
|
$ 2495 |
- Practical view of how to successfully implement a flourishing SRE culture in your organization
- The underlying principles of SRE and an understanding of what it is not in terms of antipatterns
- Organizational impact of introducing SRE. SLIs and SLOs in a distributed ecosystem and extending the usage of Error Budgets
- Building security and resilience by design in a distributed, zero-trust environment
- Implementing full-stack observability, distributed tracing and Observability-driven development culture
- Curating data using AI to move from reactive to proactive and predictive incident management
- Using DataOps to build clean data lineage
- Why Platform Engineering is important in building consistency and predictability
- Implementing practical Chaos Engineering
- Major incident response responsibilities
- SRE Execution model
Module 1: SRE Anti-Patterns
- Break the ice with a recap of DevOps Institutes SRE Blueprint
- Discuss how SRE works in a distributed ecosystem
- Discuss some of the SRE Barriers
- A few SRE Anti-Patterns (discuss the right patterns too)
- Discuss the Case Story of how Monzo bank learned from causes leading to SEV1 issue
- Case Story: Monzo Bank
- Discussion / Exercise: Good versus Bad Postmortem, Describe a Major Incident, Anti-Patterns of SRE
Module 2: SLO is a Proxy for Customer Happiness
- What has changed with SLO?
- Identifying System boundaries for setting SLIs is critical
- How do you use Error Budgets beyond the velocity versus stability debate?
- Case Story: Kudos Engineering, Home Depot
- Discussion / Exercise: Establishing SLOs in Distributed Ecosystems
Module 3: Building Secure and Reliable Systems
- Building Secure and Reliable systems
- Non-Abstract Large Scale Design
- Designing for the changing Architecture and distributed ecosystem
- Fault tolerant Design
- Designing for Security
- Designing for Resiliency
- Case Story: Chrome Security Team
- Discussion / Exercise: Non-Abstract Large Scale Design Capacity
Module 4: Full Stack Observability
- Modern Apps are Complex & Unpredictable
- Slow is the New Down
- Pillars of Observability
- Using Open Telemetry
- Case Story: Planet Labs
- Discussion / Exercise: How do you bake Observability in your Code
Module 5: Platform Engineering and AIOps
- Taking a Platform Centric View
- How do you use AIOps to improve Resiliency
- How can DataOps help you in the journey
- A simple recipe to implement AIOps
- Indicative measurement of AIOps
- Case Story: FedEx, 3M
- Discussion / Exercise: Instrumenting AIOps using Prometheus
Module 6: SRE and Incident Response Management
- SRE Key Responsibilities towards incident response
- DevOps & SRE and ITSM (new vs. old ways)
- OODA and SRE Incident Response
- SRE and CLR (closed loop remediation)
- Swarming Food for Thought
- AI/ML for better Incident Management
- Case Story: HCL AIOps Journey
- Discussion / Exercise: Teams to discuss about Swarming and Tier Layered Incident Response framework
Module 7: Chaos Engineering
- Navigating Complexity
- Chaos Engineering Defined
- Quick Facts
- Chaos Monkey Origin Story
- Who is adopting Chaos Engineering
- Myths of Chaos
- Chaos Engineering Experiments
- GameDay Exercises
- Security Chaos Engineering
- Chaos Engineering Resources
- Discussion / Exercise: Instrumenting Gremlin, Discuss how to conduct a GameDay exercise
Module 8: SRE is the Purest Form of DevOps
- Key Principles of SRE
- SREs help increase Reliability across the spectrum
- Metrics for Success
- SRE Execution models
- Culture and Behavioral Skills are key
- Transformation after implementing SRE practices
- Case Story: Airbnb
- Discussion / Exercise: Discuss NALSD learnings from Module, Transformation after implementing SRE practices
- IT leaders & managers
- Organizational change leaders and agents
- SRE engineeers
- System Integrators
- Business Stakeholders
- DevOps Practitioners
- System Integrators
- Scrum Masters/Product Owners
- Software Engineers
It is highly recommended that learners attend the SRE Foundation course and earn the SRE Foundation certification prior to attending the SRE Practitioner course and exam. An understanding and knowledge of common SRE terminology, concepts, principles and related work experience are recommended.