Fixing Bro/Zeek’s Long Connection Detection Problem
Most threat hunters will search their network for long connections that may be an indication of malicious command and control (C2) activity. But what happens when our tools give us incorrect data? In this blog post, I’m going to discuss how Bro/Zeek can be bypassed to miss long connections, and what you can do to fix the problem.
Following Along at Home
I’ve made a pcap available that helps highlight the problem. You can download a copy of it here: https://random-class.s3.amazonaws.com/longconn.pcap
This will let you follow along with this blog so that you can see the problem first hand. Let’s start by simply viewing the pcap with tshark. I’m going to use the tshark’s “field” option to help us focus on the important bits. Run the following as a single command:
tshark -r longconn.pcap -T fields -e frame.time -e ip.src -e tcp.srcport -e ip.dst -e tcp.dstport | less
Some of this output can be seen in Figure 1. Note that this session has an interesting communication characteristic. It will keep the session open, but the external IP will pause for about 20 minutes between communication exchanges. So this is one long communication session, it just goes quiet for periods of time.
Figure 1: Note that this single session includes long periods of inactivity.
We can use capinfos to verify the length of this session. This is shown in Figure 2. Note that 68,236 seconds is just a bit shy of 19 hours.
Figure 2: Capinfos shows the session in the pcap file is about 19 hours long.
Parsing the Data With Bro/Zeek
Now let’s import the pcap into Bro or Zeek to see what it makes of the data. This can be done using the command:
bro -C -r longconn.pcap
zeek -C -r longconn.pcap
I’m using Bro in these examples, so I’ll be using “bro” in my commands. If you are running Zeek, just replace “bro” with “zeek” in your commands. Your results will be the same.
Let’s leverage bro-cut to analyze the contents of the conn.log file. Run the following on a single line:
cat conn.log | bro-cut ts id.orig_h id.orig_p id.resp_h id.resp_p conn_state duration | head
This will produce results similar to Figure 3. Note that instead of a single long session, Bro/Zeek is reporting 42 shorter sessions. This is obviously incorrect. Note the “conn_state” values. “S1” indicates that in the first session, Bro/Zeek saw the TCP three packet handshake at the start of the session, but never saw the session close (FIN/ACK exchange or RST). The “OTH” connection state in the remaining entries indicates that Bro/Zeek didn’t see the TCP three packet handshake at the start of the session, nor did it see the FIN/ACK exchange to close the session. This makes perfect sense as this is actually just one long session. Bro/Zeek is incorrectly reporting this single session as multiple sessions.
Figure 3: Bro incorrectly breaks up our single long session into multiple smaller sessions.
Bro/Zeek Default Timeout Issues
The big question is, why is Bro/Zeek not reporting this session correctly? The answer is that Bro/Zeek’s TCP inactivity timeout is set too low by default. Bro/Zeek keeps a state table of all active connections. When a connection ends (such as a FIN/ACK exchange or one side sending a RST), the state entry is written to logs and then purged. So what happens if both ends of the connection lose power or Internet connectivity at exactly the same time? In this case the systems communicating would not transmit an indication that the session has ended, so Bro/Zeek would maintain state until the system is rebooted or Bro/Zeek is restarted. This is obviously not ideal, so a state table timeout entry simply monitors how long it’s been since activity has been seen on each connection. When a specific time interval is reached, Bro/Zeek assumes the connection is over and purges the entry.
The problem is Bro/Zeek’s default timeout for TCP sessions is 5 minutes. This is far too short of a time interval, as systems communicating via TCP can go quiet for much longer periods of time than that. Our sample above is a great example. Further, the problem it’s trying to solve, both ends of the connection falling offline at the same time, is really an edge case with modern communications. Most servers and network connections are relatively stable. So we are receiving inaccurate data out of concern for an extreme edge case.
Properly Detecting Long Connections With Bro/Zeek
The simplest way to resolve this problem is to extend Bro/Zeek’s TCP timeout entry. This can be done one of two ways, on the command line or within the local.bro or local.zeek config file.
Let’s walk through an example. First, let’s delete the log entries we created that incorrectly reported the session timing:
Next, we are going to override the default timeout setting when we launch Bro/Zeek with the following command:
bro -C -r longconn.pcap "tcp_inactivity_timeout = 60 min;"
If we rerun the bro-cut command shown above, we’ll see that the results are drastically different. This is shown in Figure 4. Note that the connection duration now exactly matches the duration reported by capinfos earlier in this blog.
Figure 4: Changing Bro/Zeek’s TCP timeout to one hour causes the long connection to be reported accurately.
The second way to fix this problem is to modify the local.bro or local.zeek file and change the default timing within this file. To do so, we would simply add
redef tcp_inactivity_timeout = 60 min;
To the end of the file and load the parameters when we launch Bro/Zeek:
bro -C -r longconn.pcap local
This may be useful if there are other parameters we wish to change as well.
How Bad Is This Problem?
You may be thinking, “Hey Bro/Zeek reported 42 different sessions. If I add up all of those session times does that come out to the same 19 hour duration?”. Unfortunately, not even close. Let’s run Bro/Zeek using the default timeout value and add up the duration of those 42 sessions. I’ll be using datamash to simplify this process.
First, delete the log entries we creating using the modified timeout value:
And generate the log files again using the default timeout value:
bro -C -r longconn.pcap
The contents of conn.log will again appear as it did in Figure 3. Now, run the following command on a single line to add up the duration of those 42 sessions:
cat conn.log | bro-cut id.orig_h id.resp_h duration | sort | datamash -g 1,2 sum 3
Your results should be the same as shown in Figure 5. Note that Bro/Zeek is only reporting four seconds of communication time, even though we know the session lasted just under 19 hours.
Figure 5: With the default TCP timeout, Bro/Zeek incorrectly reports the 19 hour session as only lasting 4 seconds.
If I’m checking the network for long communication sessions, 19 hours is a duration I would want to pay attention to. Four seconds would be down in the noise and completely ignored. This may mean I miss analyzing critical activity. What’s interesting about this particular example, if that it could be potentially identified as a beacon. This may mean we have a slim chance of detecting the activity, albeit for the wrong reason.
Is This Just a Bro/Zeek Problem?
While I’ve used Bro/Zeek in this example, the reality is that any network monitoring tool that uses state table timeout values can be vulnerable to this exact same issue. If you are using a different tool, you can leverage the process identified here to test the accuracy of the tool. While in this example I changed the state table timeout value to one hour, an argument could be made that four hours, or even eight hours, would be preferable as it would better cover even burster connections than the one in this example. This becomes a judgment call on the part of your security team. Certainly, you can monitor system performance to ensure that your changes have had no negative impact.
A special thank you to Keith Chew, Logan Lembke, and William Stearns who were invaluable in running down proper solutions to this problem.
Interested in threat hunting tools? Check out AI-Hunter
Active Countermeasures is passionate about providing quality, educational content for the Infosec and Threat Hunting community. We appreciate your feedback so we can keep providing the type of content the community wants to see. Please feel free to Email Us with your ideas!
Chris has been a leader in the IT and security industry for over 20 years. He’s a published author of multiple security books and the primary author of the Cloud Security Alliance’s online training material. As a Fellow Instructor, Chris developed and delivered multiple courses for the SANS Institute. As an alumni of Y-Combinator, Chris has assisted multiple startups, helping them to improve their product security through continuous development and identifying their product market fit.