Lessons Learned from Running Infrastructure at Scale: Human Factors

Photo
Chris Travers

DeliveryHero SE

Abstracts

We all know that the people who run infrastructure at scale are critical to many organizations’ success. While frameworks like Google’s Site Reliability Engineer framework have become popular in recent years, there is still a lack of focus on the human factors. This talk attempts to change this.

For the last half decade, I have been running infrastructure at high, even massive scales. While Google’s SRE framework helps to merge the systemic and business needs, what of the human needs? What can we do to help ensure that people are successful and well supported? This is extremely important as high velocity systems tend to be extremely difficult to reason about at scale, and individuals may have difficulty determining how to react during emergencies. And yet it is the human factor that keeps things running.

In this talk we will cover:
- Limitations of the SRE system at scale
- The need for human factors training of operational staff
- Collaboration at the heart of incident management
- Importance of Crew Resource Management in Operations At Scale
In this we will cover a number of low hanging fruit that you can take away for your own operational environments. These will include writing standard operating procedures for late-night incident response, standardizing emergency communication, and separating incident command from troubleshooting.

When I was heading the IT Operations department at Adjust, I looked at what we lacked and concluded it was human factors training. We brought in aviation-grade training in this area and this was a massive help. Many of the lessons the aviation industry has learned at the cost of loss of life we can apply in modified form adapted to our industry. I am here to pass on problems I have seen along with solutions which this and other training from other fields have brought me to implement.

The talk was revoked