David
Allen, HRL Laboratories, LLC, dlallen@hrl.com
[PRIMARY contact]
Tsai-Ching Lu, HRL Laboratories, LLC, tlu@hrl.com
Dave Huber, HRL Laboratories, LLC, djhuber@hrl.com
The HRL
Anomaly Analysis Tool was developed specifically for the VAST 2009
Challenge. It is designed to detect,
analyze, and visualize anomalies within the challenge dataset. It includes methods for displaying the raw
data, building network visualizations, and outputting its analysis in various formats. Most of the tool is built using MATLAB,
however it also takes advantage of some other existing tools including Pajek
and Microsoft Excel.
Video:
ANSWERS:
MC1.1: Identify which
computer(s) the employee most likely used to send information to his contact in
a tab-delimited table which contains for each computer identified: when the
information was sent, how much information was sent and where that information
was sent.
[Note: All figures are hyperlinked to a higher resolution version of
them, so please click the figure in order to make them more readable].
MC1.2: Characterize the patterns of behavior of suspicious computer use.
Based on our
investigation, we suspect that employee #30 is passing information to a
computer outside of the embassy with the destination IP of 100.59.151.133. We have identified 18 related data
transmissions. The employee has done
this every Tuesday and Thursday (except the first week), using various source
computers, but always when the actual computer user and their officemate are
not in their office. The following is a
description of the process by which we came to that conclusion.
The process we used
to analyze the data can be broken down into two main steps:
1)
Identify
suspicious behaviors, patterns, and entities, and
2)
Reanalyze
the data based on this information and begin building a case against the
suspect(s).
Identify Suspicious Behaviors, Patterns, and Entities
We began this step by
analyzing the domain, visualizing the data (see video), and defining anomalies that
could be automatically detected. We
assigned each anomaly type an apriori severity:
·
Mild
(Green): Not necessarily an anomaly, but could be part of a larger pattern
·
Moderate
(Yellow): Moderately severe anomalies which are not correct procedure, but are
‘tolerated’ (e.g. piggybacking)
·
Severe
(Red): Severe anomalies that are strictly against policy
Anomaly Types
·
ClassifiedError1
(Severe): person badged out of classified area, but never badged in
·
ClassifiedError2
(Severe): person badged in, but never badged out
·
PiggyBack1
(Moderate): person badged into classified area, but never into building
·
OffHourUse
(Mild): person badged in on a weekend or holiday (there were two observed
holidays in the dataset)
·
NoShow
(Mild): person did not show up for work on specified day
·
CompUsageInClass
(Severe): network traffic was observed on a user’s computer while they were in
the classified area
·
CompUsageNotInBldg(varies):
network traffic was observed prior to a user badging into the building
Figure
1 depicts a subset of the detected anomalies, colored by their severity,
and annotated with additional attributes.
In total there were 449 anomalies detected, which we filtered down to 98
for further examination.
Figure 1: Table listing a subset of the anomalies detected in the dataset, colored by their severity, and annotated with additional attributes such as their Date/Time, proximity card number, and source/destination IP address.
As we began analyzing these anomalies we noticed some common attributes. We therefore built a network visualization of the anomalies and these attributes, shown in Figure 2. Our tool automatically builds and outputs the network and then uses Pajek (http://pajek.imfm.si/doku.php) to interactively visualize it. A few node clusters immediately stand out. First, there is a large cluster of ‘mild’ anomalies in the upper left; these are mostly CompUsageNotInBldg and have a common destination IP (37.170.100.200), which appears to be an internal server. However immediately below that is a cluster of 8 severe anomalies (CompUsageInClass) with a common destination IP (100.59.151.133), but various different source computers. Another interesting cluster, to the right of that one, contains 3 severe ClassifiedError1 anomalies all from employee 30. Many of the other clusters tend to be isolated, however employee 80 and 49 have a few anomalies where they aren’t following proper procedures.
Figure 2: Social Network Visualization of the detected
anomalies (red, yellow, & and green nodes) their related attributes (blue
nodes). As discussed in the text,
several clusters of anomalies become immediately apparent.
Reanalyze the data based on this information and begin
building a case against the suspect(s)
Based on the
preliminary information, we decided to look into all data transfers to the
suspicious destination. There were a
total of 18 packets (see Figure 3). While analyzing these, we also identified
where the source computer’s user and their officemate were (see last 2 columns
of table). The reason the officemate is
important is that the suspect would not want to be observed using someone
else’s computer. It quickly becomes
obvious that in most instances both users were either busy (e.g. in the classified
area) or were at home. A few are labeled
as ‘CompInUse(start)’ which means that shortly after the suspicious transfer
the user began using their computer again (e.g. such as they returned to their
desk). It is interesting to note the
only instance where the officemate (#30) was actively using their computer;
this is the same employee with suspicious badge swipes (all of which occurred
on the same days as these data transfers).
Another interesting pattern seen in these packets is they appear only on
Tuesdays and Thursdays.
Figure 3: Table showing all network traffic to
destination 100.59.151.133, including the location of the source computer’s
owner and their officemate. Note that in
most instances the user and their officemate were both occupied, and hence were
probably not the originator of the network traffic.
We further identified
that all these transfers were ‘single burst traffic’, meaning that there was usually
no network activity prior to it or immediately after it. We therefore reanalyzed the data for similar
patterns and identified 186 such transfers; however no other destination had
more than 2 packets sent to it using this pattern.
We next analyze the
ratio of request size to response size, as leaking information requires sending
more than receiving. In Figure 4, we show the average ratio for specific
destination addresses (in this case we grouped them by the first two parts of
the IP address). The destination
100.59.x.x clearly stands out among the rest, in that its ratio is over
250. There are 8 IP addresses contained
in the group (including our anomalous one), however none of those others had
more than 1 packet sent to it, their ratios were also much smaller, and they
used a different port. Hence these
appear to not be related.
Figure 4: Visualization of the average ratio of request
size to response size for all the destination IP addresses. Note that one destination IP has a
significant deviation in the ratio compared with the rest.
At this point we are
pretty sure this IP address is the one accepting the leaked information and we
have not identified other suspicious network activity. Additionally, there is some indication that
employee 30 maybe involved.
We next built a
script to automatically analyze the suspicious network traffic and determine
where all users were. If they were not
in the building or were in the classified area we marked them as having an
alibi (Figure 5). We totaled up alibi’s for each user and color
coded these based on how likely they were to be a suspect. We note that only 1 user (#30) had no alibis
and only 1 (#44) had 1 alibi. This
further blames employee #30.
Figure 5: Table showing which users have an ‘alibi’
during the suspicious network traffic.
They are assumed to have an ‘alibi’ if they were not in the building or
were in the classified area at the time.
Together this data
warrants further investigation of employee #30 and destination
100.59.151.133. The perpetrator has
consistently been passing information on Tuesdays and Thursdays, therefore
monitoring the suspect’s activities on those days may lead to catching them.