JAX 2018 An Architect’s Guide to Site Reliability Engineering

By Nathaniel Schutta, Pivotal

Bok: Thinking Architecturally

JAX direct link

Monolithics principles don’t necessarily apply to microservices.

How we work together matters.

Communication is even more important in a complec world.

So what is the history of IT?

Apollo program Margaret Hamilton first SRE.

Wanted to implement error checking to avoid data erasure during takeoff. Management denied this, shit followed 1968.

Hope Is Not A Strategy!

Monolith + sysop ==> microservices + devops

CORBA ==> EJB ==> SOA ==> API first

:. Cambrian explosion of API:s

E.g. Dark Sky API <== väderapp

Amazon: Steve Yegge the Bezos mandate: all data available between public service API:s.

Present day

Troubleshooting multilevel microservice arhitecture difficult. Who is responsible?

Domain-Driven Design

Microservice definition: rewriteable in under 2 weeks.

Call graph limes death star

Everything changing makes sysop sad – how to make them happier?

  • Replace maual tasks with automation!
  • Focus on engineering
  • Helpful to know Unix and network stack
  • CAB won’t cut it

How to move fast safely?

Ops must be able to support a dynamic environment

Important to prioritize, setting aside time for this – else no automation will get done.

Establish sane SLO:s

Manage risk, shit will break.

Risk is a continuum!

How much does catastrophic failure cost? ==> Lost revenue vs cost of redundancy.

Firefightning isn’t a long term solution. It may be better to accept short term lowered SLO:s to engineer a better long term solution.

Archilochus: ”We don’t rise to the level of our expectations, we fall to the level of our training” [1]

MEANINGFUL moitoring

Alerts should require a human. The rest should self-heal.

Less grunt!

Vital to learn from outages ==> Post Mortem without blame. Consider making a PM template.

Identify

  • Action items
  • Timeline
  • Root causes [1] [2]
  • *

Online examples available.

Wheel of Misfortune <== failure role playing

Some services are more equal than others.

If uptime goal of 99%, error budget is 1% ==> use it to experiment

Draw up the architecure! Make sure everyone understands/shares a common model of the architecture.

Boktips: The Checklist Manifesto

Quantifiable and Measurable

Go through your checklists, does every service fulfill the demands?

Boktips: Building evolutionary architectures

Architectural reviews ==>

  • identify failure points.
  • Failure scenarios
  • Chaos engineering

We all need to evolve to succeed!

Boktips: Site Reliability Engineering.

 

Annonser