7 habits of highly effective SOCs
Before I talk about effective SOCs that run like well-oiled machines, let’s get one thing straight.
SOC isn’t a dirty word.
But I totally understand the negative connotation and that’s exactly why I’m writing this post. Alert fatigue is real, repetition leads to exhaustion and those two things in tandem create an environment ripe for analyst burnout. I get it.
Here’s the thing: When built right, a job working in a SOC can be so much fun, not to mention you get the learning and experience you thought you signed up for. Since launching our 24x7 service almost two years ago we’ve experimented a ton, learned a bunch, and through a lot of iteration landed on some habits — seven, in fact — that we believe help us “SOC” the right way at Expel.
If you’re working in or managing a SOC with a ton of turnover — or just want tips on how to shape an effective and more productive team — here are seven habits to adopt right now.
1. Have a clear mission and guiding principles
Get explicit about the mission and your culture. At Expel, the SOC’s mission is to protect our customers and help them improve. The mission is centered around problem solving and being a strategic partner for our customers. Notice that there are zero mentions of looking at as many security blinky lights as possible. That’s intentional.
Take it a step further and create some guiding principles. Guiding principles define what you as a team believe in and how you operate together. Here are some (but not all) of the guiding principles in the Expel SOC:
- Teamwork makes the dream work.
- Service with passion is our competitive advantage.
- We embrace positive change.
Articulating guiding principles is the first step in creating a SOC culture that you can turn into your competitive advantage. Security tech and process are easily replicated but culture is hard to copy.
2. Prioritize learning
Our analysts love to learn new things; it’s even one of the traits we hire for. One thing that we’ve learned in building out our program is that the best way to foster improvement is to combine this love of learning with a collaborative — not adversarial — approach.
The best example of this is how we use attack simulations to help our team learn new techniques. During these, we have to celebrate progress and opportunities to learn — it doesn’t take much to make someone feel foolish and have that metastasize into a reluctance to try a new thing or stretch a new skill.
If you don’t run attack simulations regularly, start building them into your schedule. But don’t overthink it. You can run one right now in eight simple steps:
- Talk to the team and given them background so they don’t feel ambushed.
- Open a PowerShell console.
- Run wmic /node:localhost process call create “cmd.exe /c notepad” from your PowerShell console to simulate remote process creation using WMI.
- Run winrs:localhost “cmd.exe /c calc” from your PowerShell console to simulate remote process creation using WinRm.
- Finally run schtasks /create /tn legit /sc daily /tr c:\users\ <user>\appdata\legit.exe to simulate the creation of a malicious Windows scheduled task.
- Interrogate your SIEM and EDR.
- Talk about it as a team.
- Find ways to improve.
3. Empower the team
Analysts want to spend time finding new things, pursuing quality leads and working with people to solve complex problems — not chasing the same false positive over and over again. Trust the team to filter out the noise and then enable them to do so.
How did we build this capability at Expel?
We took the DevOps processes used by our engineering teams and adapted them to detection deployment. Here’s a high-level overview of what this looks like:
- We manage our detection rules using GitHub.
- We have unit tests for every detection (just like you would expect of code).
- We use CircleCi to build our detection packages.
- During the CircleCi build process, we apply linting and perform additional error checking.
- If a CircleCi build fails we’ll automatically fail the PR so an analyst knows some additional tweaks are required.
- We create error codes that are easy to understand.
- We use Ansible to deploy new detection packages.
- Now an analyst can deploy a new detection package at any time as long as the content passes automated tests and has been peer-reviewed.
Here’s how this plays out in practice.
@subtee just tweeted about a new remote process execution technique …
- An analyst creates the rule in GitHub and submits a new PR.
- That PR is picked up by CircleCi, linted and checked for errors.
- Assuming all goes well, the PR is marked as “all checks passed.”
- The analyst requests peer review.
- The detection package is deployed using Ansible.
- Everyone’s happy.
Empower the team to tackle false positives and write rules to find new things. Give them control of the end-to-end system and back them up with good error checking. In doing so, your team members will feel more connected to their work and the mission.
SOC work can be repetitive. Automation FTW!
But what should you automate?
Decision support is a great place to start. What’s decision support? In our context, decision support is all of the automation, contextual enrichment and user interface attributes that make our analysts more effective in answering the following question: “Is this a thing?”
How does this play out at Expel?
As part of our integration with Office 365 we collect signal and generate alerts when accounts are compromised or user activity doesn’t seem quite right. Investigating patterns of user authentication behavior can be a tedious task when done manually … but the good news is that it’s a series of repeated steps that can be automated.
Take a look at this example where, with the help of some automation, we’re able to quickly review 30 days of login activity based on IP address and user-agent combinations:
Automate the repetitive tasks so the team can focus their efforts on making important decisions versus clicking buttons.
5. Use a capacity model
Understand your available capacity (AKA analyst hours) and utilization. Are you consistently exceeding your available capacity? Is there always way more work to do than your people can handle?
If so, cue the burnout.
If capacity modeling is new to you, that’s okay. There are plenty of resources available to help get you started.
Bottom line: Know your capacity utilization. If you discover that your team is oversubscribed, you’ll need to act fast.
6. Perform time series analysis
I agree with Yanek’s philosophy here. Effective managers are able to look out into the future and, with reasonable certainty, predict what needs to change today.
I think effective management is centered around asking the right questions and using data to answer them.
You already know that alert fatigue leads to burnout. As a manager, I ask a ton of questions about the alert management process:
- How many alerts did we send to the team last month?
- How many alerts will we send to the team next month?
- What day of the week is the busiest?
- Do we get more alerts during the day or at night?
- How many alerts will we send to the team next year?
All of these questions are centered around time. Time series analysis allows you to analyze data in order to learn what happened in the past, and to inform you on what things will likely look like in the future. By performing time series analysis you can forecast how things will change and react before it’s too late. We perform time series analysis on the historical volume of alerts sent to our team for triage. From this data, we pull out different components including trend, seasonality, and the noise AKA “the residual,” so that we can use patterns from historical behavior to help us predict future behavior. This allows us to not only more deeply analyze what’s already happened, but it’s also a way to look into the future so you can start to react now before it’s too late.
7. Measure quality
I love this tweet.
Quality control doesn’t get in the way. It pushes you forward.
At Expel, we use a quality control (QC) standard, Acceptable Quality Limits (AQL), to tell us how many alerts and incidents we should review each day. We then randomly select a number (based on AQL) of alerts, investigations and incidents and review them using a check sheet.
QC allows us to spot problems, understand them and then fix them. And fast.
I’ll be candid. At one point I thought about rebranding our SOC as a Computer Incident Response Team (CIRT) to distance ourselves from all the general negativity associated with a SOC.
But a SOC can be a great place to work if you solve problems the right way and empower your teams.
As an industry, let’s “SOC” the right way and reshape everyone’s thinking about SOCs.