JAX 2018 An Architect’s Guide to Site Reliability Engineering
Monolithics principles don’t necessarily apply to microservices.
How we work together matters.
Communication is even more important in a complec world.
So what is the history of IT?
Apollo program Margaret Hamilton first SRE.
Wanted to implement error checking to avoid data erasure during takeoff. Management denied this, shit followed 1968.
Hope Is Not A Strategy!
Monolith + sysop ==> microservices + devops
:. Cambrian explosion of API:s
E.g. Dark Sky API <== väderapp
Amazon: Steve Yegge the Bezos mandate: all data available between public service API:s.
Troubleshooting multilevel microservice arhitecture difficult. Who is responsible?
Microservice definition: rewriteable in under 2 weeks.
Everything changing makes sysop sad – how to make them happier?
- Replace maual tasks with automation!
- Focus on engineering
- Helpful to know Unix and network stack
- CAB won’t cut it
How to move fast safely?
Ops must be able to support a dynamic environment
Important to prioritize, setting aside time for this – else no automation will get done.
Establish sane SLO:s
Manage risk, shit will break.
Risk is a continuum!
How much does catastrophic failure cost? ==> Lost revenue vs cost of redundancy.
Firefightning isn’t a long term solution. It may be better to accept short term lowered SLO:s to engineer a better long term solution.
Alerts should require a human. The rest should self-heal.
Vital to learn from outages ==> Post Mortem without blame. Consider making a PM template.
Online examples available.
Wheel of Misfortune <== failure role playing
Some services are more equal than others.
If uptime goal of 99%, error budget is 1% ==> use it to experiment
Draw up the architecure! Make sure everyone understands/shares a common model of the architecture.
Boktips: The Checklist Manifesto
Quantifiable and Measurable
Go through your checklists, does every service fulfill the demands?
Boktips: Building evolutionary architectures
Architectural reviews ==>
- identify failure points.
- Failure scenarios
- Chaos engineering
We all need to evolve to succeed!
Boktips: Site Reliability Engineering.