Noblis Team - Centrifuge

VAST 2010 Challenge
Hospitalization Records -  Characterization of Pandemic Spread

Authors and Affiliations:

David C. Roberts, PhD, MPH, Noblis, Team Lead, droberts@noblis.org
Casey Henderson, Centrifuge Systems, chenderson@centrifugesystems.com
Peter Kuehl, MD, PhD, MedStar, peter.kuehl@medstar.net
Daniel Lucey, MD, Georgetown University, drl23@georgetown.edu
Michael Pietrzak, MD, MedStar, michael.pietrzak@medstar.net
Ben Pecheux, Noblis, benjamin.pecheux@noblis.org
Virginia Sielen, Noblis, virginia.sielen@noblis.org
Harry Cummins, Noblis, graphic artist, hcummins@noblis.org
Richard P. DiMassimo, Noblis, video producer, rdimassimo@noblis.org
Austin Blanton, Noblis Intern, austin.blanton@noblis.org
Catherine Campbell, PhD, Noblis, catherine.campbell@noblis.org [PRIMARY Contact]

Noblis VAST Webpage:http://www.noblis.org/VAST

Tools:

(1) PostgreSQL
(http://www.postgresql.org) as the underlying database, with phpPGAdmin/SQL for database administration and data manipulation.
(2) Microsoft Office 2007 (http://www.microsoft.com; primarily Access and Excel) for ad-hoc queries and data visualization.
(3) GeoCommons Maker (http://maker.geocommons.com/) for geographic visualization.
(4) Google Motion Chart (http://code.google.com/apis/visualization/documentation/gallery/motionchart.html) for animated charting.
(5) Centrifuge (http://www.centrifugesystems.com), a partner in this project, for exploratory data analysis and visualization.

Video:
 
Noblis_Centrifuge_MC2.mp4


ANSWERS:

MC2.1: Analyze the records you have been given to characterize the spread of the disease.  You should take into consideration symptoms of the disease, mortality rates, temporal patterns of the onset, peak and recovery of the disease.  Health officials hope that whatever tools are developed to analyze this data might be available for the next epidemic outbreak.  They are looking for visualization tools that will save them analysis time so they can react quickly.

Initial Observations.  Data were provided for hospital admissions and deaths for 11 distinct locations in 22 separate tables. These tables were imported into a shared postgreSQL database, and concatenated and joined across patient ID and location to provide a single unified table. (4 staff-hours including DB admin) The data represent a total of 14,543,948 admissions uniquely identified by patient ID and location, over the time period 4/16/09 to 6/29/09. Each admission is characterized by date, age, gender, and syndrome (a text string listing of one or more symptoms). Each death record provides a date of death for a patient ID. Of those patients admitted, 357,469 died. Using directed SQL queries, we determined that no patient was admitted more than once, and each death is associated with a single admission event. We frequently used “group by date” queries throughout the study to see admit and death rates as a function of date. This initial analysis task took approximately 2 hours to complete.
Characterization of Disease. The data were further analyzed using SQL queries and Centrifuge to parse and summarize the syndromes. We identified 1310 distinct syndrome descriptions, each identifying one or more symptoms. Viewing the frequency of deaths by syndrome and date across the entire dataset (Figure 2.1.1; note that all syndromes are shown but only a subset are labeled), only 92 of the syndromes showed a large increase followed by a decline, relative to apparent baseline rates, during the time period covered by the data and in addition were associated with nearly all deaths reported in those locations.  Figure 2.1.1 illustrates that, with one correctable exception (“nosebleeds”), these syndromes are equally represented in the data. These syndromes were designated as “outbreak-related”, and used in subsequent analyses to identify and select only the outbreak-related cases. These 92 syndromes appeared not only during the apparent outbreak period but also at a low frequency throughout the time domain of the data. Ad-hoc database queries revealed that the overall case fatality rate for patients with the outbreak-related disease was 9.87%, and that of the 357,469 deaths reported, only 11,438 (3.2 %) were not outbreak-related. Figure 2.1.2, a Centrifuge stacked bar chart, shows the development and spread of the disease, peaking nearly simultaneously over the span of one week, but in differing magnitude, in 9 of the 11 locations as expressed by outbreak-related deaths on a single timeline. Figure 2.1.3 depicts an animated GeoCommons geographic view of the disease that illustrates the change in mortality over time in the 9 outbreak locations. The hospital admission rates for outbreak related syndromes showed frequency curves that mirror these death rate curves and illustrations, and are not shown here. This characterization analysis was completed in 8 hours.

DeathbySyndrome.
Figure 2.1.1. Deaths by Syndrome and Date

OUtbreak Deaths
Figure 2.1.2. Outbreak-Related Deaths by Date and Location

Map
Figure 2.1.3. Screen Capture of Animated GeoCommons Geographic View of Mortality Timelines

Symptomatology. On reviewing the 92 outbreak related syndromes, we saw that many duplicate symptoms were listed therein. Because all 92 syndrome descriptions were equally represented, frequency of occurrence of a term in the 92 syndrome descriptions could be used to derive its probability of occurrence in a patient with the disease. Manual parsing and semantic analysis of the syndrome descriptions reduced them to 18 symptoms. Probabilities for the top 5 symptoms are provided in Figure 2.1.4; they sum to a value greater than 1 because some patients had multiple symptoms. The relatively low probability of fever among outbreak cases was noted as medically unusual, and the wide variety of symptoms was interpreted as characteristic of a viral disease agent. Ad-hoc queries revealed that the syndromes, and hence, symptoms, associated with outbreak patients that died have the same relative proportion in the data as those associated with survivors. In addition, the distribution of syndromes, and hence symptoms, did not vary significantly by location or date. This segment of analysis was completed in 8 hours.

Probability
Figure 2.1.4. Probability of Occurrence of Outbreak-Related Disease Symptoms

Age and Gender Effects.  The timelines for disease liftoff in all age groups track together for each outbreak (average outbreak-related admit dates for each age group agree within 3 days). The age distribution for outbreak related cases follows general admission rates, but is proportionally higher for the youngest and oldest age groups. There is no apparent difference among locations for age distribution of the disease. No gender effects are observed for any location or syndrome; male and femaie track together across all locations and average admit dates agree. These results were compiled in approximately 3 hours.

Saving time in the analytical process. Several approaches used here saved time in the analysis; other tools used here would allow rapid identification of an outbreak as it lifts off from baseline to facilitate early intervention. A unified table served by a fully functional remote database server provided high performance for these large data sets. Centrifuge software was found to be very useful for exploratory analysis, working seamlessly with the remote database server and making extensive use of mouseover displays of underlying attributes to allow rapid exploration of patterns within the data. The drill chart feature of Centrifuge (see accompanying video) allows exploration of underlying data for an item on one chart, expanding it to a second chart displaying other variables (Setting up Centrifuge used 4 staff-hours for setting up the system and configuration and another 12 staff-hours for learning to use the tool). The Microsoft Office suite was better suited to creating a final presentation, where font size and other aspects could be easily adjusted. Unless otherwise noted, visualizations were generated using this suite of tools. A Microsoft Access front end to the database, coupled to Excel, provided alternate and efficient means to query the data and develop appropriate graphics, and facilitated implementation of custom statistical functions such as CUSUM (see below) which could be used in predictive analysis. GeoCommons and Google Motion Charts were useful for visualizations and sharing information, but did not directly connect to our database.

MC2.2:  Compare the outbreak across cities.  Factors to consider include timing of outbreaks, numbers of people infected and recovery ability of the individual cities.  Identify any anomalies you found.

Comparison of Outbreaks Across Cities. Outbreaks of varying magnitude occurred in 9 of the 11 locations represented in the data. All outbreaks occurred nearly simultaneously, peaking within a 5 day period (see discussion of outbreak timing below), and all followed a similar time course, with a smooth onset and decline of cases. The nearly simultaneous nature of the outbreaks suggests a common origin. The well-behaved decline in cases across all locations is unusual for a transmissible illness, but could occur where the disease is aggressively treated, a rigorous quarantine is imposed, or the population has a high degree of residual immunity due to either previous exposure of the population to either the disease itself or to vaccinations.

Outbreak Timing. To determine the exact timing of each outbreak, we applied a spreadsheet implementation of the CUSUM function (Fricker RD, Hegler BL, Dunfee DA, Statist. Med. 2008, 27:3407-3429) that effectively compares an indicator function such as admits or deaths for a given day to a running baseline derived from previous values. In this case we used a 7-day running baseline. CUSUM is used in disease surveillance systems as a way to detect an outbreak just as it lifts off so that countermeasures can be initiated quickly. The alert level can be adjusted to give a low false positive rate, and the function can be run in reverse to indicate setdown--when the outbreak is officially over. Using hospital admission rates as the indicator, Figure 2.2.1 graphically compares the outbreaks observed in terms of liftoff date (identified by CUSUM on admits with an alert level of 1000), peak date and magnitude (determined by inspection), and setdown date (by reverse CUSUM, alert=0) across all 9 outbreak locations. If viewed in real time, the first liftoff would  occur in Karachi on April 23, and by May 3 all outbreaks are identified. The Aleppo outbreak peaks first on May 14, followed immediately by Karachi and Nairobi. The remaining locations have all peaked by May 20. All outbreaks are over by June 18. CUSUM was also used on the mortality data for an alternative determination of key dates, although this is not so useful as a leading indicator. The average duration of the outbreaks—i.e. the average timespan between liftoff and setdown--was 44.2 days (admits), 43.7 days (deaths), and 48.4 days (from admits liftoff to deaths setdown). Recovery ability, as determined by days from peak to setdown for CUSUM deaths, varied from 12 for Venezuela to 29 for Karachi, and appeared to correlate with total number of cases. Figure 2.2.2 illustrates the CUSUM function applied to mortality data. The cleaner mortality data allowed a nonzero value to be interpreted as a liftoff. Deaths occurred, with very few exceptions (discussed below under anomalies), exactly 8 days after admission. This analysis was completed in approximately 18 hours.

OutbreakTiming
Figure 2.2.1. Detailed Comparison of Outbreak Timing (Admits) Across Locations

CUSUM
Figure 2.2.2. CUSUM Function Applied to Deaths Data Across Outbreak Locations

Disease Severity. As shown in Figure 2.2.3, the differences in the way the disease manifests itself in the 9 locations can be seen by plotting the outbreaks by two calculated attributes that can be taken as measures of disease severity: mortality rates and relative prevalence. Mortality rates (more properly, case fatality rates) are determined as the number of outbreak-related deaths divided by number of outbreak-related illnesses, and ranged from 10.9% and 10.8% for Aleppo and Nairobi, to 8.4 and 8.7% for Saudi and Lebanon outbreaks. In terms of absolute magnitude, Aleppo and Nairobi outbreaks also had the highest relative prevalence (calculated as number of outbreak related cases divided by the number of non-outbreak cases for a given location as a relative measure of the population). Another desirable measure of disease severity, duration of symptoms, cannot be discerned from the available data. Thailand and Turkey did not exhibit a discernible outbreak. These calculations and visualizations were achieved in 8 hours.


Prevalence
Figure 2.2.3. Disease Outbreaks per Location Characterized by Mortality Rate and Relative Prevalence

Anomalies. In looking for anomalies in the data, we took the approach that all the data were potentially important, not just the data for the outbreak-related syndromes and dates. Using Centrifuge to present each syndrome count and date for a given location in a stacked bar (Figure 2.2.4; this is a representation of the Nairobi data set in the form of a scrolling bar chart with only a subset of syndromes visible in the window at any time). In this chart, we could observe the prominence of the outbreak-related syndromes, and irregularities in syndrome counts (overall bar height)--those related to the reproductive system were found to be abnormally underrepresented in all outbreak locations. The chart also gives an example of a syndrome spike anomaly (an anomalously high count-- more than 10 times greater than expected--of admits for certain dates) as can be seen in the irregular color code for the date bands for a given syndrome just to the right of the outbreak associated bars. These are observed for various syndromes, dates, and locations across the data set. Figure 2.2.5 illustrates a subset of the anomalies we observed by location and date. None of these observed anomalies occurs in more than one location, and each one is a spike on a single day of a non-outbreak-related syndrome, which makes them unlikely to be associated with the outbreak.   The significance of these spikes remains unknown. We further observed that all deaths occurred exactly 8 days after admission, except for a small minority (3.1%) of cases distributed normally between 2 and 7; if outbreak related syndromes are excluded, there still remain a very small fraction (0.23%) of these early deaths. Non-outbreak-related deaths did not exhibit any special patterns indicative of an anomaly. These data set analyses were completed in approximately 4 hours.


Anomolies
Figure 2.2.4. Use of Centrifuge to Detect Anomalies (Nairobi example)

Anomolies2
Figure 2.2.5. Summary of Syndrome Spikes in Early Part of Timeline