Screaming in the Cloud

Episode 34: Slack and the Safety Dance of Chaos Engineering

In the early days, angry nerd corners on the Internet viewed Slack and some of its predecessors as, “Oh, it’s just IRC. Now, you pay someone for it.” Many fell into that trap of wondering about what value such systems offered.The big differentiator? Slack is built as a collaborative business tool.

Today, we’re talking to Holly Allen, who helped make government software better while  serving as the director of engineering at 18F. Now, she’s a senior engineering manager at Slack, a collaborative chat program where you can do most of your work through a rich platform of integrations. Holly enjoys taking a weird set of skills that make a computer do things and convincing people who know how to make computers do things do things.

Some of the highlights of the show include:

  • Safety engineering brings chaos and resilience engineering, incident management, and post-mortem processes together for resiliency and reliability
  • Slack strives to move really fast while being in complete control
  • Slack is primarily on AWS, but is working on a multi-Cloud strategy because if AWS is down, Slack still needs to work
  • Slack has a close relationship with AWS and is a collaborative company; it has immediate access to AWS staff anytime there’s a problem
  • Slack uses Terraform and Chef and working to determine if its production workflows in Kubernetes would be worthwhile
  • Disasterpiece Theater: Real scenario that might happen and surmise what will happen; don’t cause production issues, but teach Slack employees
  • Slack hires collaborative, empathetic people to create a collaborative environment where everyone works together toward a goal
  • Slack was firmly in a centralized operations model, but is transforming toward development teams to increase responsibility and service ownership
  • Slack doesn’t encourage remote work because it’s not in a position to put in that investment; day-to-day work happens in hallways and between desks
  • Slack sees itself as an enterprise software company; an enterprise software company must have enterprise software reliability, stability, and processes
  • Slack has thousands of servers, so events and disruptions happen more often; system needs to respond, react, and repair itself without human intervention


Brought to you by Corey Quinn of Screaming in the Cloud