Ryo Sakai, KU Leuven, ryo.sakai@esat.kuleuven.be PRIMARY
Daniel Alcaide, KU Leuven, daniel.alcaide@esat.kuleuven.be
Jan Aerts, KU Leuven, jan.aerts@esat.kuleuven.be
Student Team: YES
Did you use data from both mini-challenges? YES
R (ggplot2, dplyr, igraph, tidyr, RColorBrewer, lubridate, vegan)
Processing, to
prototype
Approximately how many
hours were spent working on this submission in total?
120 hours
May we post your submission
in the Visual Analytics Benchmark Repository after VAST Challenge 2015 is
complete? YES
Video Download
Video:
ftp://ftp.esat.kuleuven.be/pub/stadius/rsakai/VAST_2015/kuleuven-sakai-mc1-video.mp4
(77MB)
ftp://ftp.esat.kuleuven.be/pub/stadius/rsakai/VAST_2015/kuleuven-sakai-mc1-video.wmv
(241MB)
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Questions
MC1.1 – Characterize
the attendance at DinoFun World on this weekend. Describe
up to twelve different types of groups at the park on this weekend.
a.
How big is this type of group?
b.
Where does this type of group like to go in the park?
c.
How common is this type of group?
d.
What are your other observations about this type of
group?
e.
What can you infer about this type of group?
f.
If you were to make one improvement to the park to
better meet this group’s needs, what would it be?
Limit your response to no more than 12 images and 1000 words.
We first parsed the GPS data to infer trajectories of individual
movements and stationary periods. We also used the provided map and the parsed
data to infer the coordinates for each attraction in the park. (For more
details on data transformation, please see our video). Using this trajectory data, we count the
number of times each individual “checks in” or appear at attractions without
the check-in system (such as Beer Gardens). If two individuals have the exact
same pattern of attraction counts, we refer to them as a “group”. Using this logic, we aggregate the individual
based on the attraction counts, and draw histograms based on the size of the
groups (Figure 1). From this figure, we
define 3 types of groups (small, medium, and large). For instance, large groups consist of 29 to
43 people who have the exact same attraction count patterns.
Figure 1. Histograms of group sizes per day.
The blue text indicates the number of counts, where the bar is harder to read.
We subset the visitors in large groups from Friday, and visualize their
behavior in the sequence view (Figure 2).
In this view, each visitor is represented as a horizontal line and the
attractions they participate are color-coded by types of attraction. For
example, the “friday_5” group consists of 30 individuals, and this group is the
only group who does not go to the show at 15:00, and relatively large gray gaps
between attractions suggests that they spend more time walking between
attractions. This group leaves the park
around 18:30. Another insight from this
representation is that we can find some variations within a large group. For instance, about a half of the “friday_1”
group goes shopping between a beer garden and a kiddie ride around 11:00. Since
none of the groups use the Information & Assistance, we hypothesize these
groups may have a tour guide or someone very familiar with the park. By examining per attraction type, we can
characterize a group better, for example, the “friday_3” group goes to Beer
Garden 7 times throughout the day.
Figure
2. Sequence view of large groups on Friday.
Figure
3 shows the sequence view of large groups on Saturday. On Saturday, every large
group goes to see the Grinosaurus Stage at 15:00. It
also appears that groups who arrive early spend more time at the entrance.
Perhaps, the park could try to minimize the waiting time to handle the arrival
of large groups in the early morning. Another general trend is that people in
large groups tend to shop at the end of the day.
Figure 3. Sequence view of large groups on
Saturday.
Probably
because of the partial closure on Sunday, the large groups on Sunday don’t have
the distinctly common behavior among them (Figure 4).
Figure 4. Sequence view of large groups on
Sunday.
We extend the analysis to medium-size groups as defined
in the Figure 1. Figure 5 shows the medium groups on Friday and sorted by their
arrival time. Although the movement
pattern varies, many sets of groups arrive and leave at the same time. We
hypothesize that these are a large group but moves around the park in a smaller
groups of 6 to 11. By comparing the size
of gray gaps between the attractions of large groups, the gray gaps in medium
groups are much smaller, suggesting these groups are more efficient and spend
their time more on attraction. Many groups also appear to spend a few hours for
shopping at the end of the visit. Other notable groups are those who arrive
around 9 and leave around 15:00, and those who arrive around 15:00 and leaves
round 22:00.
Figure
5. Sequence view of medium groups on Friday.
The
similar pattern of sets of groups arriving and leaving at the same time is
observed on Saturday (Figure6). In contrast to large groups on Saturday (Figure
3), those medium groups who arrive later tend to spend more time at the
entrance.
Figure
6. Sequence view of medium groups on Saturday.
The
sequence view of medium groups on Sunday (Figure 7) shows a similar pattern of
longer time spent at entrance if they arrive later. One anomaly was detected, where one medium size
group appears to spend a very long time in the restroom. The group is indicated with a black triangle
in Figure 7.
Figure
7. Sequence view of medium groups on Sunday.
Another
analysis approach we took was to measure Morisita overlap index to compare overlaps between
attractions. Using the derived table of attraction counts, we calculate Morisita overlap index with the “vegan” R package, and use the
dissimilarity matrix as an input for hierarchical clustering with the complete
linkage algorithm. Figure 8 shows the
resulting hierarchy as a dendrogram for the data from
Friday. The black triangle indicates the leaf node level where overlaps of
attractions are observed. For instance, there are a group people (1) who go to Tyranosaurus Restroom, MaryAnning
Beer Garden, and Alverez Beer garden. We can see Alverez
Beer garden is in a different area of the park (Wet Land). The groups (2) and (3) are both overlaps
involving rides from kiddie land.
Another way to interpret this Morisita overlap
index is by comparing the three entrances. The West, East, and North entrances
are well separated in the hierarchy because they do not overlap, in other
words, people come in and exit from the same entrance.
Figure
8. A dendrogram showing the result
of hierarchical clustering of attractions.
We
can use the insights from clustering to study specific groups. For example, if
we subset those individuals who goes to two beer gardens and the Tyrannosaurus
restroom on Friday, we find 76 individuals and the subset can be visualized in
the sequence view (Figure 9).
Figure
9. Sequence view of the beer garden groups.
MC1.2 – Are there notable differences in the patterns of
activity on in the park across the three days?
Please describe the notable difference you see.
Limit your response to no more than
3 images and 300 words.
Some notable differences in the
pattern of behavior in large groups are mentioned in MC1.1. Besides, we compare
the activities at each attraction across the three days by generating small
multiples of histogram to compare the distribution of attendance counts per
attraction. We gained a few insights. First, the Craighton
Pavilion and the Grinosaurus Stage close after 12:00
on Sunday. Second, there is a relatively high number of check-in at the Leggement Fix-Me-Up around 14:00 on Friday. Third, the park
appears busier on Saturday and Sunday than Friday, and Sunday being the
busiest.
Figure
10. Histograms of check-in counts per attraction, binned per
hour.
In
Figure 11, we aggregate the attraction counts by the area and the type of
attraction. This plot allows to compare the distribution
or the trends in the context of geographic location and the types of
activity. For example, the North
Entrance in the Entrance corridor is the most used entrance. Shopping attractions
get busier after 18:00, while rides for everyone or thrill rides quiet
down. The same anomaly due to the Leggement Fix-Me-Up, and the closure of the Craighton
Pavilion and the Grinosaurus Stage can be observed.
Figure
11. Histograms of check-in counts, aggregated by areas and types of attraction,
binned per hour.
Figure
12 compares the distributions of minutes spent at each attraction across the
three days. The duration is estimated from the GPS record. The Wrightiraptor Mountain, TerrorSaur,
Firefall and Flight of the Swingdon
have longer waiting periods on Saturday and Sunday. The Auvilotops
Express appears to have a longer waiting time only on Sunday.
Figure
12. Histograms of minutes spent at the attractions, binned per minute.
MC1.3 – What anomalies or unusual patterns do you see?
Describe no more than 10 anomalies, and prioritize those unusual patterns that
you think are most likely to be relevant to the crime.
Limit your response to no more than
10 images and 500 words.
Using the
derived trajectory data and the sequence view, we visualize the subset of
individuals who appear to be present in the Creighton Pavilion (GPS=32,33), and
draw the inferred the time period they spend at this location. We color the
line based on whether it was derived from check-in events or just from the movement
records. By comparing plots from Friday
(Figure 13), Saturday (Figure 14), and Sunday (Figure 15), we identify two
suspicious groups on Sunday before the pavilion closed. The first group, called
“group1”, consists of 37 individuals who appear to be at the pavilion during
the hours it is usually closed, and they do not check-in. The “group 2” consists of 3 visitors who
appears to stay in or near the pavilion also during the hours it is usually
closed.
Figure 13. Sequence
view at the Pavilion on Friday.
Figure 14.
Sequence view at the Pavilion on Saturday.
Figure 15.
Sequence view at the Pavilion on Sunday.
Using the GPS
data and the derived trajectory data, we calculate the total count of
“movement” GPS records and the distance traveled for each user. The result is
shown in a scatter plot (Figure 16), and we identify outliers of 7 individuals.
On the right of Figure 16 shows the movement pattern of individuals per hour,
overlaid. The movement pattern is very synchronized and regular. They only check in at the East Entrance at 8:00
and 13:00, and move back and forth to the Grinosaurus
Stage. Because of this regularity, we hypothesize these are securities working
in the park.
Figure 16.
Scatter plot of movement count and total distance traveled, and movement
pattern of identified outliers.
If we subset
individuals who check-in twice only at the East Entrance, we find one
additional individual (1787551). Then, we compare the movement pattern across
three days using the GPS and the trajectory data. The GPS trail or trajectory
path is colored based on the time of the day. Figure 17 shows the GPS trails
and Figure 18 shows the trajectory paths. The GPS trails show a very regular and synchronized patterns, while the
trajectory paths show one anomaly (1080969) who appears to slow down or paused
near the Pavilion in the morning on Sunday. We find this anomaly very
suspicious and hypothesize that this finding is related to the crime.
Figure 17. GPS trails of securities.
Figure 18. Trajectory paths of securities.
Figure 19. Comparison of start and end position per visitor. Colored
circles indicate those who come in and exit from the same entrance, and its size
represents the frequency. The arrows show anomalies who
don’t end up where they came from, and the number shows the visitor id. For
example, 657863’s last GPS is near the Scholtz
Express on Friday, but his/her GPS starts there and goes to the North Entrance
on Saturday. This could be suspicious, but it could also be a case of lost
mobile.