Periscopic – Aggregate Symptoms
Visualization
Hospitalization Records -
Characterization of Pandemic Spread
Authors and
Affiliations:
Kim Rees, Periscopic, kim@periscopic.com
Tool(s):
Tableau Desktop Software was used
for the majority of visual analysis. Tableau is a data visualization and
business intelligence tool.
Microsoft Excel was also used for
additional data formatting.
http://office.microsoft.com/en-us/excel/
Video:
http://periscopic.com/internal/PeriscopicAggregateMC2.1.mp4
http://periscopic.com/internal/PeriscopicAggregateMC2.2.mp4
ANSWERS:
MC2.1: Analyze the records you
have been given to characterize the spread of the disease. You should take into consideration symptoms
of the disease, mortality rates, temporal patterns of the onset, peak and
recovery of the disease. Health
officials hope that whatever tools are developed to analyze this data might be
available for the next epidemic outbreak.
They are looking for visualization tools that will save them analysis
time so they can react quickly.
ASSESSMENT:
The disease began in Iran
on 4/19/2009 where the disease peaked on 5/27/2009 and was under control by
6/27/2009. There were 550 deaths per day at the peak of the disease in Iran. The
disease then spread to Lebanon (4/20), Saudi Arabia (4/21), Colombia and
Venezuela (4/23), Nairobi (4/24), Aleppo (4/28), and Karachi and Yemen (4/29).
Turkey and Thailand appeared to be unaffected by the disease. Most deaths
occurred six to eight days after hospital admission.
Each outbreak subsequent
to Iran was more quickly controlled than the previous with the exception of
Colombia (which still had a shorter duration than Iran). Iran had a duration of
68 days while the final outbreak, Yemen, was brought under control within 53
days.
However, despite the
significant reduction in the time to control the disease, the death rate increased
as the pandemic spread. The first five cities affected ranged from a 0.2%-2.7%
death rate (of total admissions for the pandemic period) while the final four
cities death rates ranged from 3.1%-4.8%. It is unclear why the death rate increased
as the recovery period decreased. Death rates did not seem to correlate to
population size, geographic region, or other variables.
The presenting symptomatology
of the disease seemed to consist of nosebleeds, fever, rash, abdominal pain,
and vomiting. The vast majority of deaths in every city had the admission
symptoms of abdominal pain and vomiting.
PROCESS:
My initial survey of the
data revealed numerous inconsistencies. It appeared that a single symptom could
be labeled many different ways. For instance, abdominal pain could be denoted
as “acute abdominal pain” or “unspecified abdominal pain.” Additionally, these
labels often contained abbreviations (“abd pain”) or were combined with other
symptoms (“abd pain, fever”). Some symptoms had over 60 variations. Presumably,
these inconsistencies were due to lack of a centralized or prescribed reporting
methodology or system. Different individuals or reporting agencies certainly
would have diverse approaches of describing symptoms.
Grouping the symptoms
also eliminates the fragmentation of incidents for that symptom. Without
grouping, a symptom that has 40 variations effectively falsifies the data of
that symptom. It would make it appear to have a rate of incidence at
potentially 1/40th of its actual rate. By grouping we can obtain a
more accurate tally of a symptom.
When tasked with spotting
a disease or epidemic, it was apparent that certain data could be ignored. For
instance, admission reports of lacerations, car accidents, and injuries could
be excluded as they cannot be considered indicators for such phenomenon.
Additionally, when trying
to assess high level information for a task such as this, it seemed justifiable
that similar symptoms could be viewed as one in the same. For example,
“abdominal pain” and “upper right abdominal pain” are similar enough to be
considered in the same category. This obviously would not be a valid solution
for diagnosing a disease, but for spotting a trend it seemed worthwhile to
group similarities.
With a massaged dataset
(using Venezuela as an example), I first explored visualizing the data as a
line chart. This clearly revealed the bell curve one would expect with an
epidemic. It also exposed a clear spike of fever, abdominal pain, and vomiting
reports prior to the epidemic manifestation. When viewing all symptoms (as opposed
to aggregate), the bell curve is apparent, but the preliminary spike is not as
evident.
Interestingly, the onset
symptom appears to be dizziness and blurred vision when viewing all symptoms,
but when grouped the indicators are clearly fever, rash, and abdominal pain.
The onset is on 4/23/09 in both views, however, the grouped view shows a more
dramatic and apparent spike. However, fever does not appear to lead to death;
perhaps if fever is detected as an early anomaly, that knowledge could be used
to prevent an epidemic.
Figure 1: Dashboard of epidemic over time and onsets.
Showing all symptoms on the top row and grouped symptoms on the bottom.
Figure 2: Dashboard showing admissions and deaths by
symptom group.
RECOMMENDATIONS:
It is advisable to create
a dashboard analytics system or subsystem that is dedicated to detecting
outbreaks, epidemics, and pandemics.
In order to quickly identify
a disease outbreak, many techniques can be employed. These can be used without
changing existing data systems, reporting agency methodologies, or any existing
inconsistencies. These solutions work at a high level, merely incorporating the
various incoming reporting feeds and data.
1) Aggregate Symptoms: Group symptoms that are
the same but have different spellings, abbreviations, are variations, etc.
2) Ignore non-Indicators: Remove, hide, or
exclude symptoms that would not indicate a disease (i.e. broken arm, dog bite,
etc.).
3) Prioritize Symptoms: There are obviously
conditions that are more likely to occur with a contagious disease such as
fever, chills, pallor, vomiting, weakness, etc. These symptoms should be
prioritized over others such as bleeding, stroke, and speech problems.
Although, while less likely to be an indicator, those outlying symptoms should
not be overlooked.
4) Dashboard: These indicators should be
visualized in various ways over time. By doing this, they can be observed on a
daily or hourly basis. Employing different methods of viewing the same data
should allow for any anomalies to become readily apparent. Line charts, heat
matrices, and sized symbol charts among others can be used in conjunction to
synthesize a time-based picture of the data.
5) Compare: The time-based dashboard should
extend at least 12 months in the past. This will enable one to see seasonal
fluctuations or other normal occurrences as compared to the current incoming
data.
By following these
recommendations, one should be able to quickly spot anomalies, outliers, and
trends which could lead to heading off an outbreak or at least being able to
better manage or react to an epidemic.
Figure 3: Identifying early indicators with a heat matrix of admission rates.
MC2.2: Compare the outbreak
across cities. Factors to consider
include timing of outbreaks, numbers of people infected and recovery ability of
the individual cities. Identify any
anomalies you found.
ASSESSMENT:
The disease began in Iran
on 4/19/2009 where the disease peaked on 5/27/2009 and was under control by
6/27/2009. The disease then spread to Lebanon (4/20), Saudi Arabia (4/21), Colombia
and Venezuela (4/23), Nairobi (4/24), Aleppo (4/28), and Karachi and Yemen
(4/29). Turkey and Thailand appeared to be unaffected by the disease.
Each outbreak subsequent
to Iran was more quickly controlled than the previous with the exception of
Colombia (which still had a shorter duration than Iran). Iran had a duration of
68 days while the final outbreak, Yemen, was brought under control within 53
days.
The pandemic seemed to
focus on the Greater Middle East, including Lebanon, Aleppo, Saudi Arabia,
Yemen, Iran, and Pakistan. However, the disease also spread to parts of South
America and Africa. The rate of the pandemic spreading to these areas was rapid
with South America being hit within three days of the initial onset and Africa
following a day later.
PROCESS:
Building off the results
I found in MC2.1, I was able to make comparisons across all cities. I
summarized the onset, peak, and recovery dates, deaths, admissions, and
admission symptoms for all cities in an Excel spreadsheet. Bringing this
summarized data into Tableau, I could quickly visualize the locations and how
they related to one another.
I looked at the recovery
rates as a scatterplot comparing onset date, days to recovery, and rate of
death per city. It was interesting to note that although the pandemic was more
controlled as it progressed, the rate of death increased. Aleppo had the
highest rate of death at 4.7% of admissions, yet was the shortest lived, being
controlled within 51 days.
Figure 4: Recovery rates by city. Size indicates rate of
death. X axis is onset date and Y axis is total days of the outbreak.
Additionally, I was able
to animate the number of deaths over time for each city. By using a map, it was
easy to see where the pandemic started and how it quickly spread over time.
This screenshot shows a static image from the animation. It shows Karachi and
Lebanon at the peak of their epidemics on 5/25/09.
Figure 5: Size indicates current number of deaths and color
shows overall rate of recovery for the city.
The specific process I
employed was as follows:
1) Exported the death records for each city that
I had summarized using Tableau.
2) Recorded summary data about each city that I
was able to ascertain from MC2.1.
3) Created a new Excel file that tabulated the
data from steps 1 & 2.
4) Using a Tableau Add-in for Excel, I was able
to re-shape the data into a format that Tableau could understand.
5) Used Tableau to connect to this Excel
spreadsheet with the ability to link summary data and death records by city.
6) Plotted this data in various ways in Tableau
such as line charts, scatterplots, and animated maps.
These were all manual
tasks with the exception of step 6. However, these manual tasks were completed
in under three hours including research time to find the tool in step 4. Once
the data was formatted, the time to create visualizations using Tableau was
minimal. It would be very easy task to create dashboards and charts that could facilitate
in assessing data over larger time periods and regions.
In summary, using techniques
such as these not only clarifies the characteristics of the pandemic, but may
also help detect the onset more rapidly or manage its spread.