Entry Name:  "KULEUVEN-Sakai-MC2"

VAST Challenge 2015
Mini-Challenge 2

 

 

Team Members:

Ryo Sakai, KU Leuven, ryo.sakai@esat.kuleuven.be     PRIMARY
Daniel Alcaide, KU Leuven, daniel.alcaide@esat.kuleuven.be

Jan Aerts, KU Leuven, jan.aerts@esat.kuleuven.be

Student Team:  YES

 

Did you use data from both mini-challenges?  YES

 

Analytic Tools Used:

R (ggplot2, dplyr, igraph, tidyr, RColorBrewer, lubridate)

Processing, to prototype

 

Approximately how many hours were spent working on this submission in total?

100 hours

 

May we post your submission in the Visual Analytics Benchmark Repository after VAST Challenge 2015 is complete? YES

 

 

Video Download

Video:

ftp://ftp.esat.kuleuven.be/pub/stadius/rsakai/VAST_2015/kuleuven-sakai-mc2-video.mp4   (28MB)

ftp://ftp.esat.kuleuven.be/pub/stadius/rsakai/VAST_2015/kuleuven-sakai-mc2-video.wmv   (124MB)

 

-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Questions

 

MC2.1Identify those IDs that stand out for their large volumes of communication.  For each of these IDs

 

      a.         Characterize the communication patterns you see.

      b.        Based on these patterns, what do you hypothesize about these IDs?

 

Limit your response to no more than 4 images and 300 words.

 

We computed the authority and hub score for all the individuals based on the network of communication each day, and plot the results on scatter plots (Figure 1). Then, we identified three outliers (1278894, 839736, and external) with very large scores.  For each outlier, we plotted histograms of communication counts by faceting on whether incoming or outgoing communication binned per minute.  

 

The communication pattern of 1278894 shows a very regular volumes and intervals (Figure 2). Because of this regularity, the volume (more than 1000 outgoing message a minute), the fact that there is no gps track of this id and all the messages were sent from Entry Corridor, we suspect 1278894 is a message bot programed to send out a message and also can receive a message back.

 

The communication pattern of 839736 shows a very prominent peak shortly after 12:00 on Sunday (Figure 3).  Although 839736 sends and receives messages on Friday and Saturday, the magnitude of the message sent and received after 12:00 on Sunday is on another scale. As with the previous id, 839736 does not have gps data and all the outgoing messages are sent from the Entry Corridor, we suspect this is also a message bot. Because of the peak on Sunday shortly after 12:00, we believe this message bot is related to the security and we will elaborate more in the following sections.

 

The communication pattern to the external is shown in the Figure 4.  Although the external may be many individuals outside the park, the peak before 12:00 on Sunday stands out.

 

 

Figure 1. Scatter plots of authority scores(AS) and hub scores(HS) of every individuals for each day.

 

Figure 2. Histograms of  communication counts by 1278894.                                                                                                                                         

 

 

Description: Macintosh HD:Users:Ryo:Desktop:PhD:Year_4:22_VAST_2015:Final_Image:MC2.1_839736_histogram.png

Figure 3. Histograms of  communication counts by 839736.                                                                                                                                            

 

Description: Macintosh HD:Users:Ryo:Desktop:PhD:Year_4:22_VAST_2015:Final_Image:MC2.1_external_histogram.png

Figure 4. Histograms of communication counts to the “external”.                                                                                                                                 

 

 

 

 

 

MC2.2Describe up to 10 communications patterns in the data. Characterize who is communicating, with whom, when and where. If you have more than 10 patterns to report, please prioritize those patterns that are most likely to relate to the crime.

 

Limit your response to no more than 10 images and 1000 words.

 

Description: Macintosh HD:Users:Ryo:Desktop:PhD:Year_4:22_VAST_2015:Final_Image:com_count_to_external_sunday.eps

Figure 5. A heatmap of communication count to “external” between 11:45 and 12:00 on Sunday.  We infer the location of communication origin based on the GPS data, and subset all the communication to “external” during the time period of the peak seen in Figure 4. The heatmap shows most of communication originates from 3 positions near the Creighton Pavilion.

 

 

Description: Macintosh HD:Users:Ryo:Desktop:PhD:Year_4:22_VAST_2015:Final_Image:communication_count_839736.png

Figure 6. Heatmaps of incoming and outgoing communication counts to/from 839736 between 12:00 and 13:00 on Sunday. This time interval corresponds to the peak of communication found in the Figure 3. The most active location in both plots is (32, 34), just outside of the Creighton Pavilion.

 

Based on the insights from Figure 5, 6, we first subset those individuals who communicate to the “external” between 11:45 and 12:00 from three coordinate positions (32,33), (31,34), and (32,34), then we generate a network graph of communications that these individuals are involved in, either sending or receiving during this time period.  We draw a directed graph network and color those individuals who communicated to the “external” in green (Figure 7).  From this network, we can identify 2 major sub-graphs. The first sub-graph, referred to as “subgraph1”, consists of mostly those who appeared at positions near the pavilion and has a fairly high graph density (0.38). A close up of subgraph1 is shown in the Figure 8. The second sub-graph, referred to as “subgraph2”, is a larger graph (136 nodes), which appears to include multiple community structures. A close-up of this sub-graph is shown in Figure 9. 

 

For the rest of analysis, we exclude all the communication to or from 839736, 1278894, and external, unless clearly expressed otherwise.

 

Description: Macintosh HD:Users:Ryo:Desktop:PhD:Year_4:22_VAST_2015:R_Exploratory_Sketch:group_1_com_pattern.png

Figure 7. A node-link diagram of the communication pattern among those who communicates to the external from positions near the pavilion between 11:45 and 12:00 on Sunday. The green nodes indicate the group identified, and the white nodes indicates those who are communicated.

 

 

Description: Macintosh HD:Users:Ryo:Desktop:PhD:Year_4:22_VAST_2015:Final_Image:subgraph1.png

Figure 8. A close-up node-link diagram of subgraph1. The ID of each individual is indicated.

 

Description: Macintosh HD:Users:Ryo:Desktop:PhD:Year_4:22_VAST_2015:Final_Image:subgraph2.png

Figure 9. A close-up node-link diagram of subgraph2.

 

In order to examine the temporal communication patterns of subgraph1 and subgraph2, we subset the all the communications which involve at least one member of a subgraph and aggregate the origin and destination of the communication and select 10 most frequent communication paths. As mentioned previously, we infer the origin and destination location based on the closest GPS record of the sender and the receiver of a message. We then plot small multiples of these most frequent communication paths by the hour, encoding the frequency with line width for the subgraph1 (Figure 10) and subgraph2 (Figure 11).  Figure 10 provides 2 useful insights.  First, the communication among the subgraph1 peaks at 11, near the pavilion.  Second, most of their communications are close range communications, suggesting this group do not interact with other groups in the park. In contrast to subgraph1, subgraph2 shows long-range communication paths. There are 2 peaks of close proximity communication, when they arrive in the park at the north entrance, and at 11am near the pavilion. Also, at 11am, there are prominent outward communication from the pavilion to the Rhynasaurus Rampage and the SabreTooth Theatre.

 

Description: Macintosh HD:Users:Ryo:Desktop:PhD:Year_4:22_VAST_2015:R_Exploratory_Sketch:com_subgraph1_by_hour.png

Figure 10. Communication of subgraph1 aggregated by hour. The ten most frequent communication paths are shown, with varying line width to encode the frequency.

 

Description: Macintosh HD:Users:Ryo:Desktop:PhD:Year_4:22_VAST_2015:R_Exploratory_Sketch:subgraph2_com_pattern_by_hour_arrpw.png

Figure 11. Communication of subgraph2 aggregated by hour. The ten most frequent communication paths are shown, with varying line width to encode the frequency. The direction of communication is indicated with an arrowhead.

 

To examine the overview of communication pattern based on frequency and locations, we generated histograms with stacked bar charts (Figure 12) to annotate previously identified IDs with high volumes of communication (1278894, 839736, and external).  One of key insights was recurring peaks in the Coaster Alley at 11:00 and 16:00. To identify the origin and destination of communication these peaks, we generate a multi-panel plot (Figure 13). The Figure 13 shows that most of communication for these peaks is sent from the vicinity of the Grinosaurus Stage, and sent to all over the park, but more frequently with in the vicinity.

 

 

Description: Macintosh HD:Users:Ryo:Desktop:PhD:Year_4:22_VAST_2015:Final_Image:overview_histogram.png

  Figure 12.  Histograms of communication count over three days, faceted by the location attribute.  The histograms are binned per 5 minutes, and shown as stacked bar charts to highlight those with large volumes of communication.

 

Description: Macintosh HD:Users:Ryo:Desktop:PhD:Year_4:22_VAST_2015:Final_Image:CoasterAlleyPeaks.png

 

Figure 13. Communication count peaks from the Coaster Alley at 11:00 and 16:00.  The first row contains histograms of communication count from the Coaster Alley, binned per minute. The peaks of communication count are identified at 11:00 and 16:00.  The second row of this figure shows where the communication at 11:00 and 16:00 originates. The color is used to compare the communication count at 11:00 and 16:00.  The third row shows the communication count based on the destination of the communication.

 

MC2.3From this data, can you hypothesize when the crime was discovered?  Describe your rationale.

 

Limit your response to no more than 3 images and 300 words. 

 

We hypothesize that the crime was discovered around 11:45 on Sunday. As seen in Figure 14, there is the outburst of communication to the external between 11:45 and 12:00, and most of the communication originates from within the Creighton Pavilion (Figure 5). This peak perhaps corresponds to calls to the local police outside of the park.  Then, at 12:00 there is a prominent peak of communication to 839736.  Assuming 839736 is an automated message bot, which deals with the security related issues, this is when the message is exchanged between the visitors and the security to warn or report the crime. Figure 15 suggests that the preceding peak between 11:20 and 11:45 is due to the high volume of communication within the subgraph1, identified in MC2.2. The communication between 11:20 and 11:45 are mostly within (32, 33) coordinate, which is just outside of the pavilion.

 

Description: Macintosh HD:Users:Ryo:Desktop:PhD:Year_4:22_VAST_2015:wetland_sunday.pdf 

Figure 14. Histogram of communication count from the Wet Land on Sunday, with a bin width of a 5-minute interval. Each bar is a stacked bar chart categorizing communication to those with large volumes of communication.

Description: Macintosh HD:Users:Ryo:Desktop:PhD:Year_4:22_VAST_2015:subgraphs.pdf

Figure 15. Histogram of communication count from the Wet Land on Sunday, with a bin width of a 5-minute interval. Each bar is a stacked bar chart categorizing communication to previously identified groups of IDs.

 

 

 

Description: Macintosh HD:Users:Ryo:Desktop:PhD:Year_4:22_VAST_2015:peak_wetland.png

 

Figure 16.  Heatmaps of communication count in the Wet Land between 11:20 and 11:45 on Sunday.