g2lab presents IEEE VAST Challenge 2010

g2lab presents

VAST 2010 Challenge
Hospitalization Records - Characterization of Pandemic Spread

Authors and Affiliations:

Christoph Kinkeldey, g2lab, HafenCity University Hamburg, christoph.kinkeldey@hcu-hamburg.de
Anna-Lena Kornfeld, g2lab, HafenCity University Hamburg, anna-lena.kornfeld@hcu-hamburg.de

Tool(s):

A Java-based integrated tool using the following libraries:

ParVis http://www.mediavirus.org/parvis/

Prefuse library http://prefuse.org/

JFreeChart http://www.jfree.org/jfreechart/

Video:

VAST 2010 Mini-Challenge 2 from g2lab hcu on Vimeo.

ANSWERS:

MC2.1: Analyze the records you have been given to characterize the spread of the disease. You should take into consideration symptoms of the disease, mortality rates, temporal patterns of the onset, peak and recovery of the disease. Health officials hope that whatever tools are developed to analyze this data might be available for the next epidemic outbreak. They are looking for visualization tools that will save them analysis time so they can react quickly.

We started to analyze the hospitalization records by examining the given attributes of the patients. We considered syndrome descriptions as a major key to reveal significant characteristics of the disease. Furthermore, we assumed that temporal changes within the syndromes could give further insights.

For this reason, we extracted a number of datasets from the given tables containing the number of symptoms per day in respect to all patients and divided them into different groups (mortality, gender, age classes). We implemented the filters on the basis of Java. With this, we were able to get a first impression of the data structure.

The number of unique syndromes is quite high (> 1000) and their format differs, e.g. according to the order of the symptoms in the syndrome description and the punctuation. For this reason, we examined the syndrome descriptions and used self-made Java tools to analyze their structure (how many single symptoms? how can they be separated?).

Regarding an appropriate visualization method we had to consider the high number of different syndromes to be displayed over time. This resulted in a simple line chart with half-transparent lines indicating the absolute number of different syndrome occurrences per day:

With visualizing the extracted data in ParVis, it became evident that a subset of the syndromes reveals a conspicuous rise and decline in most of the datasets. This gave us some indication as to how the disease could be described. For the first part of our visual analytics tool we brushed the interesting syndromes by adapting the ParVis view in a modified version. This turned out to be a valuable step in the analysis process. Outliers included in some datasets could easily be detected with this visualization method.

The next task was the analysis of the selected syndrome subsets. It was useful to prepare the syndromes in order to reveal hidden information. By splitting up the syndrome descriptions (at every space or punctuation character) we obtained a list of words, including the symptoms. It was likely that the epidemic could be described by a combination of certain symptoms. Therefore, our objective was to find characteristic word combinations from the split-up syndromes. Two questions seemed to be crucial:

1. How often do single words occur?

2. How often do words appear as a combination in the syndrome descriptions?

To provide a visual solution, we chose a graph view with each node representing a word and edges representing which words occur in one syndrome description. The size and opacity of the nodes depict the number of occurrences whereas the thickness and opacity of the edges express how often the respective two words appear in combination. This part of the tool was implemented using the Prefuse library in Java.

For further analysis we integrated the death rate into the graph view as an important indicator for the epidemic. We encoded this attribute as the thickness of each node’s contour in relation to its size. We added the gender and age distribution information as part of an integrated view which is displayed by clicking on a node (for a single word) or on an edge (for the word combination). As a result, we found a significant combination in most of the datasets:

The combination of the word pairs ‘VOMITING’ and ‘DIARRHEA’ as well as ‘ABD’ and ‘PAIN’ constituted the highest number of occurrences and a strong correlation within each pair.

High death rates support the assumption that this combination is significant for the pandemic spread. The age and gender distribution illustrates that the two pairs seem to behave quite homogeneous – a further important finding is that the disease affects both men and woman almost equally.

MC2.2: Compare the outbreak across cities. Factors to consider include timing of outbreaks, numbers of people infected and recovery ability of the individual cities. Identify any anomalies you found.

Based on our hypothesis, we systematically compared the situation in different countries. To do so, we filtered each hospitalization record dataset according to the symptom combination. The objective was to find out whether the syndrome occurrences over time differ in the given countries.

We decided to compare the qualitative development of the syndrome over time because we did not have the information to what part of the population the given data relates. For the comparison we chose a line graph, showing all functions scaled to each maximum value. A small circle represents the peak of each line to facilitate the comparison of the temporal development.

Another interesting question was how the death rates develop in each country. For this purpose, we extended the syndrome comparison chart by adding a diagram showing the mortality rate for each day and country. Both charts were implemented using the JFreeChart library which enabled the integration into our Java-based tool.

The visual comparison of the charts resulted in the conclusion that most of the countries show similar characteristics. The rise and decline of the occurring syndromes had a symmetric shape with a clear peak in the center.

With Turkey and Thailand, there are two countries out of eleven that show a different shape. The course of their functions was irregular without exposing the characteristic peak.

Concentrating on the other countries, we detected similarities in the temporal course. All graphs display a rise and decline within approximately four weeks each. Even the peaks did not differ very much in time ranging from May 13^th (Aleppo, Syria) to May 25^th (Colombia).

The overall comparison led to the interpretation that the high correlation between the different datasets confirms our hypothesis to be significant for the majority of the countries.

Based on the fact that Turkey and Thailand did not exhibit the typical peak in the filtered syndromes we conclude that the epidemic did not take place in these countries – at least not in the given period of time.

All the peaks occurred within a period of three weeks. The geographical distance did not seem to play a role during the spread of the disease. We assume that it spread out more or less simultaneously and did not infect one region after the other.

g2lab presents

VAST 2010 Challenge Hospitalization Records - Characterization of Pandemic Spread

Authors and Affiliations:

Tool(s):

VAST 2010 Challenge
Hospitalization Records - Characterization of Pandemic Spread