pcap-parser is an offline processing and analysis tool with
deep packet inspection capabilities which collects statistics
and other processing results from a set of network capture files.

In this post I'll talk a bit about the implementation details
and challenges -- met and solved and still open ones -- while implementing the
first versions of this tool.



Processing Throughput

The goal was a tool which is potentially able to process giga- or even terabytes of captured network data over a series of capture files, all captured without capturing filters. So the processing throughput had a high priority as 1 TB of data takes a long time to process if the processing speed is 10 MB/s. Currently, the tool can process the data at 100..200 MB/s with input capture files stored on a middle class SSD. Throughput rate depends on I/O throughput, single-thread processor speed and -- by a small amount -- memory speed.

  • input buffering was very important (of course!)
  • no separate thread for input buffering yet, as there's a dependency chain, and the bottleneck is currently in the processing (CPU bound), even though original goal was to be I/O-bound, the processing can't be easily optimized by vectorization or similar approaches because network protocols and their formats are not really suited for vectorized processing using SSE(2), AVX(2) (special x86/amd64 instruction set extensions for vectorized processing)
  • processing packets in multiple threads in a round-robin approach has challenges with respect to statistical counter and data structure updates, simple locking might not be the right choice because of lock contention, but this has not been experimented with yet


Use of Common Data Structures

The core program makes heavy use of std::unordered_map<> with custom hashing functions on IPv4 and IPv6 addresses, MAC addresses and combinations of these. std::unordered_map<> requires a hashing function which returns the hash value as a size_t which is an unsigned 64 bits value for the target platform of x64/amd64. A single IPv6 address contains 128 bits, so the hashing function must somehow half the number of bits while still retaining the full dynamic range of the output type.

This is even worse in combinations of (MAC addr,IPv4 addr) and (MAC addr,IPv6 addr), so currently only parts of the addresses are combined in the hashing function.

std::unordered_map<> is also used for higher layer processing, e.g. HTTP request targets with strings as key type and currently using std::hash<> for hashing of the strings.


Port Mappings

Fixed TCP/UDP port mappings to protocol-specific processing modules does not catch cases in which non-standard ports are used. Allowing the user to configure a set of additional port mappings decreases this problem a little bit, but it is no ultimate solution because the users would have predetermined knowledge and/or use some guessing.

There's probably no ultimate solution except an exhaustive search in a very large search space. This directly competes with the goal of high processing throughput. Most protocols do not directly contain identification information, with the text-based ones allowing at least looking for some key text strings, but binary ones are really problematic.

One partial solution would be some kind of statistical model which, applied to a packet payload,would result in a probability distribution for a set of candidate protocols.

Note that this is only relevant for the higher-layer protocols because the lower-layer protocols all have identification fields with a set of standardized values for the next higher protocol.


TCP stream reassembly is harder than it might seem

Captured traffic contains packets in both directions, programs doing reassembly have to differentiate the directions and extract the original semantics as the TCP receive windows for both directions are independent. This led to the need to implement a way more complex state machine than the standard TCP one.

The extraction of the TCP three way handshake is relatively straightforward to implement.

There's always the danger that packets got dropped by the capturing process which actually have been sent over the network. This is an ongoing challenge and isn't limited to TCP-based traffic.

Parts or all of the TCP connection shutdown sequence can have been dropped by the capturing process, so I had to implement an expiration mechanism for TCP connection contexts.


I experimented a lot with different UI toolkits and frameworks, all having some kind of disadvantage.

Displaying thousands of entries (e.g. IP addresses) is not practical for user interface latency issues and other reasons, so pagination had to be implemented (in the interface between frontend and backend, too).

The frontend is implemented in C# using Windows Presentation Foundation (WPF).

As the processing core had to be in native code for performance reasons, all the data displayed in the UI is transported via PInvoke. backend.dll uses the same core processing code as pcap-parser.exe, but adds PInvoke compatible exports which are called by the frontend.

Use of PInvoke

Standard PInvoke marshalling is used currently, the interface exported by backend.dll has been designed around that, but is cumbersome sometimes. It is C style, i.e. just a flat set of functions with a handle mechanism for the backend to be able to reference its context. Output for these functions is usually in the form of structures suitable for PInvoke standard marshalling, e.g.

struct BackendEthernetAddressStats
uint32_t magic;
    uint32_t version;
    uint64_t macAddr;
    uint32_t count;
    char mfg[256];

with the corresponding declaration in the .NET code:

[StructLayout(LayoutKind.Sequential, Pack = 4)]
public struct BackendEthernetAddressStats
public UInt32 magic;
public UInt32 version;

public UInt64 macAddr;
public UInt32 count;

[MarshalAs(UnmanagedType.ByValTStr, SizeConst = 256)]
public string mfg;

Because there's quite a few contexts the user can switch to in the frontend each with a specific sets of metrics in mind, the interface between frontend and backend needs to be somewhat fine-granular. The data structures in the core module need a substantial amount of memory depending on the amount of traffic in the input files, so mirroring these in their entirety into the .NET managed memory area seems like a waste of resources.

There might be a future article with additional details.