Periscopic – Aggregate Symptoms Visualization

VAST 2010 Challenge
Hospitalization Records -  Characterization of Pandemic Spread

Authors and Affiliations:

Kim Rees, Periscopic, kim@periscopic.com

 

Tool(s):

Tableau Desktop Software was used for the majority of visual analysis. Tableau is a data visualization and business intelligence tool.

http://tableausoftware.com

 

Microsoft Excel was also used for additional data formatting.

http://office.microsoft.com/en-us/excel/

 

Video:

http://periscopic.com/internal/PeriscopicAggregateMC2.1.mp4

http://periscopic.com/internal/PeriscopicAggregateMC2.2.mp4

 

 

ANSWERS:


MC2.1: Analyze the records you have been given to characterize the spread of the disease.  You should take into consideration symptoms of the disease, mortality rates, temporal patterns of the onset, peak and recovery of the disease.  Health officials hope that whatever tools are developed to analyze this data might be available for the next epidemic outbreak.  They are looking for visualization tools that will save them analysis time so they can react quickly.

ASSESSMENT:

The disease began in Iran on 4/19/2009 where the disease peaked on 5/27/2009 and was under control by 6/27/2009. There were 550 deaths per day at the peak of the disease in Iran. The disease then spread to Lebanon (4/20), Saudi Arabia (4/21), Colombia and Venezuela (4/23), Nairobi (4/24), Aleppo (4/28), and Karachi and Yemen (4/29). Turkey and Thailand appeared to be unaffected by the disease. Most deaths occurred six to eight days after hospital admission.

Each outbreak subsequent to Iran was more quickly controlled than the previous with the exception of Colombia (which still had a shorter duration than Iran). Iran had a duration of 68 days while the final outbreak, Yemen, was brought under control within 53 days.

However, despite the significant reduction in the time to control the disease, the death rate increased as the pandemic spread. The first five cities affected ranged from a 0.2%-2.7% death rate (of total admissions for the pandemic period) while the final four cities death rates ranged from 3.1%-4.8%. It is unclear why the death rate increased as the recovery period decreased. Death rates did not seem to correlate to population size, geographic region, or other variables.

The presenting symptomatology of the disease seemed to consist of nosebleeds, fever, rash, abdominal pain, and vomiting. The vast majority of deaths in every city had the admission symptoms of abdominal pain and vomiting.

PROCESS:

My initial survey of the data revealed numerous inconsistencies. It appeared that a single symptom could be labeled many different ways. For instance, abdominal pain could be denoted as “acute abdominal pain” or “unspecified abdominal pain.” Additionally, these labels often contained abbreviations (“abd pain”) or were combined with other symptoms (“abd pain, fever”). Some symptoms had over 60 variations. Presumably, these inconsistencies were due to lack of a centralized or prescribed reporting methodology or system. Different individuals or reporting agencies certainly would have diverse approaches of describing symptoms.

Grouping the symptoms also eliminates the fragmentation of incidents for that symptom. Without grouping, a symptom that has 40 variations effectively falsifies the data of that symptom. It would make it appear to have a rate of incidence at potentially 1/40th of its actual rate. By grouping we can obtain a more accurate tally of a symptom.

When tasked with spotting a disease or epidemic, it was apparent that certain data could be ignored. For instance, admission reports of lacerations, car accidents, and injuries could be excluded as they cannot be considered indicators for such phenomenon.

Additionally, when trying to assess high level information for a task such as this, it seemed justifiable that similar symptoms could be viewed as one in the same. For example, “abdominal pain” and “upper right abdominal pain” are similar enough to be considered in the same category. This obviously would not be a valid solution for diagnosing a disease, but for spotting a trend it seemed worthwhile to group similarities.

With a massaged dataset (using Venezuela as an example), I first explored visualizing the data as a line chart. This clearly revealed the bell curve one would expect with an epidemic. It also exposed a clear spike of fever, abdominal pain, and vomiting reports prior to the epidemic manifestation. When viewing all symptoms (as opposed to aggregate), the bell curve is apparent, but the preliminary spike is not as evident.

Interestingly, the onset symptom appears to be dizziness and blurred vision when viewing all symptoms, but when grouped the indicators are clearly fever, rash, and abdominal pain. The onset is on 4/23/09 in both views, however, the grouped view shows a more dramatic and apparent spike. However, fever does not appear to lead to death; perhaps if fever is detected as an early anomaly, that knowledge could be used to prevent an epidemic.

OnsetDashboard.jpg

Figure 1: Dashboard of epidemic over time and onsets. Showing all symptoms on the top row and grouped symptoms on the bottom.

DeathsDashboard.jpg

Figure 2: Dashboard showing admissions and deaths by symptom group.

RECOMMENDATIONS:

It is advisable to create a dashboard analytics system or subsystem that is dedicated to detecting outbreaks, epidemics, and pandemics.

In order to quickly identify a disease outbreak, many techniques can be employed. These can be used without changing existing data systems, reporting agency methodologies, or any existing inconsistencies. These solutions work at a high level, merely incorporating the various incoming reporting feeds and data.

1)       Aggregate Symptoms: Group symptoms that are the same but have different spellings, abbreviations, are variations, etc.

2)       Ignore non-Indicators: Remove, hide, or exclude symptoms that would not indicate a disease (i.e. broken arm, dog bite, etc.).

3)       Prioritize Symptoms: There are obviously conditions that are more likely to occur with a contagious disease such as fever, chills, pallor, vomiting, weakness, etc. These symptoms should be prioritized over others such as bleeding, stroke, and speech problems. Although, while less likely to be an indicator, those outlying symptoms should not be overlooked.

4)       Dashboard: These indicators should be visualized in various ways over time. By doing this, they can be observed on a daily or hourly basis. Employing different methods of viewing the same data should allow for any anomalies to become readily apparent. Line charts, heat matrices, and sized symbol charts among others can be used in conjunction to synthesize a time-based picture of the data.

5)       Compare: The time-based dashboard should extend at least 12 months in the past. This will enable one to see seasonal fluctuations or other normal occurrences as compared to the current incoming data.

By following these recommendations, one should be able to quickly spot anomalies, outliers, and trends which could lead to heading off an outbreak or at least being able to better manage or react to an epidemic.

heatmatrix.jpg

Figure 3: Identifying early indicators with a heat matrix of admission rates.

 


MC2.2:  Compare the outbreak across cities.  Factors to consider include timing of outbreaks, numbers of people infected and recovery ability of the individual cities.  Identify any anomalies you found.

ASSESSMENT:

The disease began in Iran on 4/19/2009 where the disease peaked on 5/27/2009 and was under control by 6/27/2009. The disease then spread to Lebanon (4/20), Saudi Arabia (4/21), Colombia and Venezuela (4/23), Nairobi (4/24), Aleppo (4/28), and Karachi and Yemen (4/29). Turkey and Thailand appeared to be unaffected by the disease.

Each outbreak subsequent to Iran was more quickly controlled than the previous with the exception of Colombia (which still had a shorter duration than Iran). Iran had a duration of 68 days while the final outbreak, Yemen, was brought under control within 53 days.

The pandemic seemed to focus on the Greater Middle East, including Lebanon, Aleppo, Saudi Arabia, Yemen, Iran, and Pakistan. However, the disease also spread to parts of South America and Africa. The rate of the pandemic spreading to these areas was rapid with South America being hit within three days of the initial onset and Africa following a day later.

PROCESS:

Building off the results I found in MC2.1, I was able to make comparisons across all cities. I summarized the onset, peak, and recovery dates, deaths, admissions, and admission symptoms for all cities in an Excel spreadsheet. Bringing this summarized data into Tableau, I could quickly visualize the locations and how they related to one another.

I looked at the recovery rates as a scatterplot comparing onset date, days to recovery, and rate of death per city. It was interesting to note that although the pandemic was more controlled as it progressed, the rate of death increased. Aleppo had the highest rate of death at 4.7% of admissions, yet was the shortest lived, being controlled within 51 days.

recoveryrates.jpg

Figure 4: Recovery rates by city. Size indicates rate of death. X axis is onset date and Y axis is total days of the outbreak.

Additionally, I was able to animate the number of deaths over time for each city. By using a map, it was easy to see where the pandemic started and how it quickly spread over time. This screenshot shows a static image from the animation. It shows Karachi and Lebanon at the peak of their epidemics on 5/25/09.

map.jpg

Figure 5: Size indicates current number of deaths and color shows overall rate of recovery for the city.

The specific process I employed was as follows:

1)       Exported the death records for each city that I had summarized using Tableau.

2)       Recorded summary data about each city that I was able to ascertain from MC2.1.

3)       Created a new Excel file that tabulated the data from steps 1 & 2.

4)       Using a Tableau Add-in for Excel, I was able to re-shape the data into a format that Tableau could understand.

5)       Used Tableau to connect to this Excel spreadsheet with the ability to link summary data and death records by city.

6)       Plotted this data in various ways in Tableau such as line charts, scatterplots, and animated maps.

These were all manual tasks with the exception of step 6. However, these manual tasks were completed in under three hours including research time to find the tool in step 4. Once the data was formatted, the time to create visualizations using Tableau was minimal. It would be very easy task to create dashboards and charts that could facilitate in assessing data over larger time periods and regions.

In summary, using techniques such as these not only clarifies the characteristics of the pandemic, but may also help detect the onset more rapidly or manage its spread.