What SOC Teams Can Learn from the Aviation Industry

The cybersecurity industry has spent a lot of time talking about improving the analyst experience without making significant improvements. Much of the effort has been too focused on trying to find a silver bullet solution. Combine that with a global pandemic and things are just getting worse.

A survey of cybersecurity professionals published in the 2021 Devo SOC Performance ReportTM found that on a 10-point scale, where 10 indicates SOC staff have a “very painful” experience performing their jobs, 72% of respondents rated the pain of SOC analysts at a 7 or above.

Instead of thinking about a silver bullet to alleviate SOC pain, let’s focus on one of the top causes of the problem — alert fatigue — and how the cybersecurity industry could follow the lead of another field to find a solution.

In the SOC Performance Report, a whopping 61% said a cause of SOC pain was that there are too many alerts to chase. I think it’s safe to draw the connection that “alert fatigue” will expand to “posture fatigue” and “policy fatigue,” as it adversely affects both recruitment and all too critical retention of experienced SOC professionals.

Alert Fatigue May Exit the Aircraft

So, if we can’t figure it out within the security industry, let’s learn from others. Many non-cyber industries and professions also suffer from alert fatigue, so perhaps the cybersecurity industry can apply what those groups have learned. If we ask those in other industries noted for alert fatigue “how do alarms, warnings, and alerts differ?” I think we’ll find much similarity and overlap in answers — in both the theory and practice of how human operators are supposed to respond and how they do in reality.

For this post, let’s compare the aviation industry to the SOC. Aviation companies and professionals have navigated many of the problems SOC operators face today and have made the most progress in governing and managing the ergonomics of sensory overload and automation. When you think about it, the inside of a cockpit with all its knobs, buttons, lights and alerts is a lot like the combined dashboards SOC analysts have to navigate when triaging, investigating and responding to threats.

As far back as 1988, the Washington Post reported on “glass cockpit” syndrome in the aviation industry, which is quite similar to what many say or think about the SOC today. Researchers from the American Psychological Association noted that pilots would “fall victim to information overload and ignore the many bits of data pouring from myriad technical systems,” and their studies of airline crashes found that “black box recordings showed that the crews talked about ‘how the systems sure were screwed up’ but did not verify what was wrong. In both cases, the systems worked but crews failed to check the information and crashed.”

In another example, research published in 2001 by the Royal Institute of Technology examined “the alarm problem” in aviation, meaning, “in the most critical situations with the highest cognitive load for the pilots, the technology lets you down.” The reports noted that “the warning system of the modern cockpits are not always easy to use and understand. The tendency is to overload the display with warnings, cautions and inoperative system information accompanied by various audio warnings.” It identified one of the main problems caused by this overload as “a cognitive problem of understanding and evaluating from the displayed information which is the original fault and which are the consecutive faults.” Sound familiar? You would likely hear something eerily similar from a SOC analyst.

In the decades that followed, aircraft cockpit design has progressively applied new learnings and automation to dynamically manage alert volume and the attention of the pilot to priorities. In the Royal Institute of Technology’s report, researchers identified accident simulation as an effective tool for improving cockpit alert systems, finding more associable ways to present alerts such as differentiating sounds and the introduction of context, which would allow pilots to “immediately understand what part or function of the aircraft is suffering a malfunction.” More context also would include guidance on what to do next. In its conclusion the study noted:

“Such simulations would hopefully result in less cognitive stress on behalf of the pilots: they would know that they have started to solve the right problem. They would not have to worry that they have entered the checklist at the wrong place. With a less stressful situation even during malfunctions there is greater hope for correct actions being taken, leading to increased flight safety.”

SOC systems need to embrace and apply many of these same learnings that have spanned decades for aviation. The majority of the cybersecurity industry seems to have only gotten as far as color coding alert and warning significance, leaving the analyst with a hundred flashing red priorities, even after triaging it. It’s no surprise that SOC analysts are both overwhelmed and unable to respond to complex threats across a broadening attack surface.

Autopilot Isn’t the Automatic Answer

When it comes to solving the issue of alert fatigue in the SOC, automation is typically one of the first things that comes to mind. The same applied to aviation in 1988, where that Washington Post report quoted researchers saying what could appear in a security trade publication in 2022:

“Research is badly needed to understand just how much automation to introduce — and when to introduce it — in situations where the ultimate control and responsibility must rest with human operators,” said psychologist Richard Pew, manager of the experimental psychology department at BBN Systems and Technologies Corp. in Cambridge, Mass.

“Everywhere we look we see the increasing use of technology,” Pew said. “In those situations where the operator has to remain in control, I think that we have to be very careful about how much automation we add.

“The growing use of high-tech devices in the cockpit or on ships can have two seemingly contradictory effects. One response is to lull crew members into a false sense of security. They “regard the computer’s recommendation as more authoritative than is warranted,” Pew said. “They tend to rely on the system and take a less active role in control.” Sometimes crews are so mesmerized by technological hardware that they are lulled into what University of Texas psychologist Robert Helmreich calls ‘automation complacency.’”

And while automation can play an important part in incident response and investigation — just as it does in modern aircraft cockpit design — there are some things to consider:

  1. Situational awareness is lost. Automation is often brittle, unable to operate outside of the situations for which it is programmed, and it’s subject to inappropriate performance due to faulty sensors or limited knowledge about a situation.
  2. Automation creates high workload spikes (such as when routine changes or a problem occurs) and long periods of boredom (in which attention wavers and response to exceptions may be missed). If you’re staffing for automation-level activities, how do you manage capacity for spikes?

The SOC Earns its Wings

As an industry, we should take a page from the aircraft handbook and avoid increasing cognitive demands, workload and distractions on analysts, and make tasks easier to perform. But it’s crucial that we also understand how to better manage automation failure and exceptions. Here are some ideas:

  • Embrace AI and autocomplete: Like the more advanced autocomplete functions in email and word processing applications, SOC analysts are still in charge of managing an incident, with automation providing an opportunity to further guide and preemptively enrich a threat investigation, thereby increasing the speed and robustness of response.
  • Distill and prioritize at the incident level, not the alert level: It’s not about filtering/correlating/aggregating alerts, it’s about contextualizing both events and alerts in the background and only articulating an incident in plain single-sentence language. Analysts can double-click down from there.
  • Leverage a community of experts: As attack surfaces grow and vertical technology specialization becomes tougher for in-house SOC analysts to cover (particularly in times of competing incident prioritization), it becomes increasingly important to be able to “phone a friend” and access an on-demand global pool of expert talent. It’s like having several Boeing engineers sitting in the cockpit with the pilot to troubleshoot a problem with the plane.