Reliability in a world of endless surprise (EN)
Our systems are too big, too complex, and change too quickly to understand them completely. Emergent phenomena elude prediction, even with perfect knowledge. Determinism is an appealing but fundamentally flawed way of thinking about complex systems and is progressively being rejected by natural sciences as well. Instead of chasing an ever-more-perfect mental model, we are better served by embracing ambiguity and surprise. This leads to investing in isolation, ad-hoc debugging and recovery instead of preventing errors or investing in curated dashboards. It also means shifting our language from “root causes” to a more nuanced view of contributory factors.
In this talk, I will tell the story of how thinking about knowledge and cause and effect has evolved in philosophy of science over time and how our field’s developing notions of reliability mirror this evolution but are quite a bit behind. By learning from other fields, we can advance our profession without having to re-discover the same truths from scratch. Finally, I will show how making the shift to rapid recovery for reliability purposes happens to also be the most economic approach to balancing innovation with reliability.
Aaron has been building, breaking, and fixing systems for over a decade from tiny startups to serving over 139M subscribers at Netflix. He is presently applying his passion for empiricism and system design to multi-region high-availability architecture and operations while managing the Demand Engineering team at Netflix. Previously, Aaron co-authored Chaos Engineering (O’Reilly, 2017).
Christo Erasmus & Ameet Sarvaiya
Ian Buchanan & Antonia Verdi
David Taberno Sánchez & Manuel Grädel
Damla Simsek & Dea Noe Leimbacher
Bhushan Bagi & Ameur Djaffri
Prof. Dr. Lutz Jäncke
Torben Hoeft & Martin Fisch