Euan is part of the Operations & Reliability team at the Financial Times, managing incidents across the globe. Before that, he lead a distributed team responsible for Go microservices, Docker containers in Kubernetes, and the backend APIs powering the website.
On the Ops-ier side of DevOps, he has occasionally admitted to being a sysadmin in public.
More and more developers are expected to provide out-of-hours support and respond to production issues. If you don't have much experience handling incidents, it can be scary and intimidating - but it doesn’t have to be that way!
We transformed our incident response on the FT's Content team - from a number of terrifying multi-hour outages, to a stable platform where team members feel comfortable on-call.
I'll provide practical tips and advice on:
- planning for the next failure
- what to do when Everything Is On Fire™
- improving things afterwards
- and some horror stories of our own...