Comments (6)
I think there are several buckets here:
GameDay firedrills: A one-time, manual test in which you simulate a failure and attempt to resolve it, and then process the results of that into potential system improvements. Best resource I've found so far on this is: Learning to Embrace Failure (Limoncelli et al.). Would love to find others though.
Fault injection: An ongoing, programatic test regime in which software injects faults into the system at runtime. The most obvious resource here is the post Chaos Monkey Released Into The Wild (Bennett and Tseitlin), but again would be great to see more.
Before/after failure testing: An important pattern when fixing system problems is: 1) observe the problem (say in production), 2) reproduce the problem (say in a dev environment) in the form of a "failing" simulation, 3) develop the fix, and 4) test the fix against the same simulation that previously failed, to ensure it's actually been fixed. This full flow is often reduced to just 1+3, which is problematic. Would be great to have a resource specifically speaking to this discipline. I don't know of one though.
from services-engineering.
A paper on production fault inject from Berkley:
Failure as a Service (FaaS): A Cloud Service for Large- Scale, Online Failure Drills (Gunawi et al.)
from services-engineering.
GameDay firedrills:
I think we'll find alot more literature for GameDay firedrills by searching for Disaster Recovery. Indeed, a quick search on ACM yields Weathering the Unexpected (Limoncelli again), a paper about how Google performs routine disaster recovery exercises. DR has a ton of academic research around it as well, along with its close cousin business continuity.
Fault injection: What about Allspaw? This is one of his oft-mentioned topics as well. Here's a link to his paper Fault Injection in Production: Making the case for resilience testing. As the title suggests, this paper is about justifying doing such a thing and doesn't really include how it's done in practice.
from services-engineering.
Wow great suggestions @chooper, I'll definitely check these out!
from services-engineering.
Added a chunk of links, included the 2 you suggested @chooper.
I think this provides pretty good coverage, except for the "before/after failure testing" case mentioned above.
Will keep this open as we continue to noodle on it / search.
from services-engineering.
Nice, I'll see what I can dig up for the last category
from services-engineering.
Related Issues (20)
- Book: Distributed systems for fun and profit
- Post๏ผ Distributed Systems Design
- Post / Website: Lambda Architecture
- Book: The Practice of Cloud System Administration HOT 1
- Book: Failure is Not an Option
- Book: Inviting Disaster
- Book - Drift into Failure
- Conference: Config Management Camp
- Conference: Gluecon
- Conference: SREcon
- Conference: USENIX Configuration Management Summit
- Conference: USENIX Release Engineering Summit
- Conference: @Scale
- AwesomeOpenSource HOT 1
- Presentation: Using Logs To Build a Solid Data Infrastructure (Martin Kleppmann)
- missing a license HOT 1
- Validate pull requests with Travis HOT 2
- Lecture notes: An introduction to distributed systems (Kingsbury) HOT 1
- Book: Systems Performance - Brendan Gregg
- Broken link HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. ๐๐๐
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google โค๏ธ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from services-engineering.