Christoph Kinkeldey, g2lab, HafenCity University Hamburg,
christoph.kinkeldey@hcu-hamburg.de
Anna-Lena Kornfeld, g2lab, HafenCity University Hamburg,
anna-lena.kornfeld@hcu-hamburg.de
A Java-based integrated tool using the following libraries:
ParVis http://www.mediavirus.org/parvis/
Prefuse library http://prefuse.org/
JFreeChart http://www.jfree.org/jfreechart/
Video:
VAST 2010 Mini-Challenge 2 from g2lab hcu on Vimeo.
ANSWERS:
MC2.1: Analyze the
records you have been given to characterize the spread of the disease. You
should take into consideration symptoms of the disease, mortality rates,
temporal patterns of the onset, peak and recovery of the disease. Health
officials hope that whatever tools are developed to analyze this data might be
available for the next epidemic outbreak. They are looking for
visualization tools that will save them analysis time so they can react
quickly.
We started to analyze the hospitalization records by examining
the given attributes of the patients. We considered syndrome descriptions as a major
key to reveal significant characteristics of the disease. Furthermore, we
assumed that temporal changes within the syndromes could give further insights.
For this reason, we extracted a number of datasets
from the given tables containing the number of symptoms per day in respect to all
patients and divided them into different groups (mortality, gender, age
classes). We implemented the filters on the basis of Java. With this, we were
able to get a first impression of the data structure.
The number of unique syndromes is quite high (>
1000) and their format differs, e.g. according to the order of the symptoms in
the syndrome description and the punctuation. For this reason, we examined the
syndrome descriptions and used self-made Java tools to analyze their structure
(how many single symptoms? how can they be separated?).
Regarding an appropriate visualization method we had
to consider the high number of different syndromes to be displayed over time. This
resulted in a simple line chart with half-transparent lines indicating the
absolute number of different syndrome occurrences per day:
With visualizing the extracted data in ParVis, it
became evident that a subset of the syndromes reveals a conspicuous rise and
decline in most of the datasets. This gave us some indication as to how the disease
could be described. For the first part of our visual analytics tool we brushed
the interesting syndromes by adapting the ParVis view in a modified version. This
turned out to be a valuable step in the analysis process. Outliers included in
some datasets could easily be detected with this visualization method.
The next task was the analysis of the selected
syndrome subsets. It was useful to prepare the syndromes in order to reveal hidden
information. By splitting up the syndrome descriptions (at every space or
punctuation character) we obtained a list of words, including the symptoms. It
was likely that the epidemic could be described by a combination of certain
symptoms. Therefore, our objective was to find characteristic word combinations
from the split-up syndromes. Two questions seemed to be crucial:
1. How often do single words occur?
2. How often do words appear as a combination in the syndrome
descriptions?
To provide a visual solution, we chose a graph view
with each node representing a word and edges representing which words occur in
one syndrome description. The size and opacity of the nodes depict the number
of occurrences whereas the thickness and opacity of the edges express how often
the respective two words appear in combination. This part of the tool was
implemented using the Prefuse library in Java.
For further analysis we integrated the death rate into
the graph view as an important indicator for the epidemic. We encoded this
attribute as the thickness of each node’s contour in relation to its size. We
added the gender and age distribution information as part of an integrated view
which is displayed by clicking on a node (for a single word) or on an edge (for
the word combination). As a result, we found a significant combination in most
of the datasets:
The combination of the word pairs ‘VOMITING’ and
‘DIARRHEA’ as well as ‘ABD’ and ‘PAIN’ constituted the highest number of
occurrences and a strong correlation within each pair.
High death rates support the assumption that this
combination is significant for the pandemic spread. The age and gender
distribution illustrates that the two pairs seem to behave quite homogeneous –
a further important finding is that the disease affects both men and woman almost
equally.
MC2.2: Compare the outbreak
across cities. Factors to consider include timing of outbreaks, numbers of
people infected and recovery ability of the individual cities. Identify any
anomalies you found.
Based on our hypothesis, we systematically compared
the situation in different countries. To do so, we filtered each
hospitalization record dataset according to the symptom combination. The objective
was to find out whether the syndrome occurrences over time differ in the given countries.
We decided to compare the qualitative development of
the syndrome over time because we did not have the information to what part of
the population the given data relates. For the comparison we chose a line
graph, showing all functions scaled to each maximum value. A small circle
represents the peak of each line to facilitate the comparison of the temporal development.
Another interesting question was how the death rates develop
in each country. For this purpose, we extended the syndrome comparison chart by
adding a diagram showing the mortality rate for each day and country. Both
charts were implemented using the JFreeChart library which enabled the
integration into our Java-based tool.
The visual comparison of the charts resulted in the
conclusion that most of the countries show similar characteristics. The rise
and decline of the occurring syndromes had a symmetric shape with a clear peak
in the center.
With Turkey and Thailand, there are two countries out
of eleven that show a different shape. The course of their functions was
irregular without exposing the characteristic peak.
Concentrating on the other countries, we detected
similarities in the temporal course. All graphs display a rise and decline within
approximately four weeks each. Even the peaks did not differ very much in time ranging
from May 13th (Aleppo, Syria) to May 25th (Colombia).
The overall comparison led to the interpretation that
the high correlation between the different datasets confirms our hypothesis to
be significant for the majority of the countries.
Based on the fact that Turkey and Thailand did not
exhibit the typical peak in the filtered syndromes we conclude that the epidemic
did not take place in these countries – at least not in the given period of
time.
All the peaks occurred within a period of three weeks.
The geographical distance did not seem to play a role during the spread of the
disease. We assume that it spread out more or less simultaneously and did not
infect one region after the other.