Ryo Sakai, KU Leuven, ryo.sakai@esat.kuleuven.be PRIMARY
Daniel Alcaide, KU Leuven, daniel.alcaide@esat.kuleuven.be
Jan Aerts, KU Leuven, jan.aerts@esat.kuleuven.be
Student Team: YES
Did you use data from both mini-challenges? YES
R (ggplot2, dplyr,
igraph, tidyr, RColorBrewer, lubridate)
Processing, to
prototype
Approximately how many
hours were spent working on this submission in total?
100 hours
May we post your submission
in the Visual Analytics Benchmark Repository after VAST Challenge 2015 is
complete? YES
Video Download
Video:
ftp://ftp.esat.kuleuven.be/pub/stadius/rsakai/VAST_2015/kuleuven-sakai-mc2-video.mp4
(28MB)
ftp://ftp.esat.kuleuven.be/pub/stadius/rsakai/VAST_2015/kuleuven-sakai-mc2-video.wmv
(124MB)
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Questions
MC2.1
– Identify those
IDs that stand out for their large volumes of communication. For each of these IDs
a. Characterize the communication patterns
you see.
b. Based on these patterns, what do you
hypothesize about these IDs?
Limit
your response to no more than 4 images and 300 words.
We
computed the authority and hub score for all the individuals based on the
network of communication each day, and plot the results on scatter plots
(Figure 1). Then, we identified three outliers (1278894, 839736, and external)
with very large scores. For each
outlier, we plotted histograms of communication counts by faceting on whether
incoming or outgoing communication binned per minute.
The
communication pattern of 1278894 shows a very regular volumes and intervals
(Figure 2). Because of this regularity, the volume (more than 1000 outgoing
message a minute), the fact that there is no gps track of this id and all the
messages were sent from Entry Corridor, we suspect 1278894 is a message bot
programed to send out a message and also can receive a message back.
The
communication pattern of 839736 shows a very prominent peak shortly after 12:00
on Sunday (Figure 3). Although 839736
sends and receives messages on Friday and Saturday, the magnitude of the
message sent and received after 12:00 on Sunday is on another scale. As with
the previous id, 839736 does not have gps data and all the outgoing messages
are sent from the Entry Corridor, we suspect this is also a message bot.
Because of the peak on Sunday shortly after 12:00, we believe this message bot
is related to the security and we will elaborate more in the following
sections.
The
communication pattern to the external is shown in the Figure 4. Although the external may be many individuals
outside the park, the peak before 12:00 on Sunday stands out.
Figure 1. Scatter plots of authority scores(AS) and hub scores(HS) of
every individuals for each day.
Figure 2. Histograms of communication counts by 1278894.
Figure 3. Histograms of communication counts by 839736.
Figure 4. Histograms of communication
counts to the “external”.
MC2.2
– Describe up to
10 communications patterns in the data. Characterize who is communicating, with
whom, when and where. If you have more than 10 patterns to report, please prioritize
those patterns that are most likely to relate to the crime.
Limit
your response to no more than 10 images and 1000 words.
Figure
5. A heatmap of communication count to “external” between 11:45 and 12:00 on
Sunday. We infer the location of
communication origin based on the GPS data, and subset all the communication to
“external” during the time period of the peak seen in Figure 4. The heatmap
shows most of communication originates from 3 positions near the Creighton
Pavilion.
Figure
6. Heatmaps of incoming and outgoing communication counts to/from 839736
between 12:00 and 13:00 on Sunday. This time interval corresponds to the peak
of communication found in the Figure 3. The most active location in both plots
is (32, 34), just outside of the Creighton Pavilion.
Based on the insights from Figure 5, 6, we first subset those
individuals who communicate to the “external” between 11:45 and 12:00 from
three coordinate positions (32,33), (31,34), and (32,34), then we generate a
network graph of communications that these individuals are involved in, either
sending or receiving during this time period.
We draw a directed graph network and color those individuals who
communicated to the “external” in green (Figure 7). From this network, we can identify 2 major
sub-graphs. The first sub-graph, referred to as “subgraph1”, consists of mostly
those who appeared at positions near the pavilion and has a fairly high graph
density (0.38). A close up of subgraph1 is shown in the Figure 8. The second
sub-graph, referred to as “subgraph2”, is a larger graph (136 nodes), which
appears to include multiple community structures. A close-up of this sub-graph
is shown in Figure 9.
For the
rest of analysis, we exclude all the communication to or from 839736, 1278894,
and external, unless clearly expressed otherwise.
Figure
7. A node-link diagram of the communication pattern among those who
communicates to the external from positions near the pavilion between 11:45 and
12:00 on Sunday. The green nodes indicate the group identified, and the white
nodes indicates those who are communicated.
Figure
8. A close-up node-link diagram of subgraph1. The ID of each individual is
indicated.
Figure
9. A close-up node-link diagram of subgraph2.
In order
to examine the temporal communication patterns of subgraph1 and subgraph2, we
subset the all the communications which involve at least one member of a
subgraph and aggregate the origin and destination of the communication and
select 10 most frequent communication paths. As mentioned previously, we infer
the origin and destination location based on the closest GPS record of the
sender and the receiver of a message. We then plot small multiples of these
most frequent communication paths by the hour, encoding the frequency with line
width for the subgraph1 (Figure 10) and subgraph2 (Figure 11). Figure 10 provides 2 useful insights. First, the communication among the subgraph1 peaks
at 11, near the pavilion. Second, most
of their communications are close range communications, suggesting this group
do not interact with other groups in the park. In contrast to subgraph1,
subgraph2 shows long-range communication paths. There are 2 peaks of close
proximity communication, when they arrive in the park at the north entrance,
and at 11am near the pavilion. Also, at 11am, there are prominent outward
communication from the pavilion to the Rhynasaurus Rampage and the SabreTooth
Theatre.
Figure
10. Communication of subgraph1 aggregated by hour. The ten most frequent
communication paths are shown, with varying line width to encode the frequency.
Figure
11. Communication of subgraph2 aggregated by hour. The ten most frequent
communication paths are shown, with varying line width to encode the frequency.
The direction of communication is indicated with an arrowhead.
To
examine the overview of communication pattern based on frequency and locations,
we generated histograms with stacked bar charts (Figure 12) to annotate
previously identified IDs with high volumes of communication (1278894, 839736,
and external). One of key insights was
recurring peaks in the Coaster Alley at 11:00 and 16:00. To identify the origin
and destination of communication these peaks, we generate a multi-panel plot
(Figure 13). The Figure 13 shows that most of communication for these peaks is
sent from the vicinity of the Grinosaurus Stage, and sent to all over the park,
but more frequently with in the vicinity.
Figure 12.
Histograms of communication count over three days, faceted by the
location attribute. The histograms are
binned per 5 minutes, and shown as stacked bar charts to highlight those with
large volumes of communication.
Figure
13. Communication count peaks from the Coaster Alley at 11:00 and 16:00. The first row contains histograms of
communication count from the Coaster Alley, binned per minute. The peaks of
communication count are identified at 11:00 and 16:00. The second row of this figure shows where the
communication at 11:00 and 16:00 originates. The color is used to compare the
communication count at 11:00 and 16:00.
The third row shows the communication count based on the destination of
the communication.
MC2.3
– From this data, can you hypothesize when the crime was discovered? Describe your rationale.
Limit your response to no more than 3 images and 300 words.
We
hypothesize that the crime was discovered around 11:45 on Sunday. As seen in
Figure 14, there is the outburst of communication to the external between 11:45
and 12:00, and most of the communication originates from within the Creighton
Pavilion (Figure 5). This peak perhaps corresponds to calls to the local police
outside of the park. Then, at 12:00
there is a prominent peak of communication to 839736. Assuming 839736 is an automated message bot,
which deals with the security related issues, this is when the message is
exchanged between the visitors and the security to warn or report the crime.
Figure 15 suggests that the preceding peak between 11:20 and 11:45 is due to
the high volume of communication within the subgraph1, identified in MC2.2. The
communication between 11:20 and 11:45 are mostly within (32, 33) coordinate,
which is just outside of the pavilion.
Figure 14.
Histogram of communication count from the Wet Land on Sunday, with a bin width
of a 5-minute interval. Each bar is a stacked bar chart categorizing
communication to those with large volumes of communication.
Figure 15.
Histogram of communication count from the Wet Land on Sunday, with a bin width
of a 5-minute interval. Each bar is a stacked bar chart categorizing
communication to previously identified groups of IDs.
Figure
16. Heatmaps of communication count in
the Wet Land between 11:20 and 11:45 on Sunday.