Bangor - VASTvis

Video:

ANSWERS:

MC2.1: Analyze the records you have been given to characterize the spread of the disease. You should take into consideration symptoms of the disease, mortality rates, temporal patterns of the onset, peak and recovery of the disease. Health officials hope that whatever tools are developed to analyze this data might be available for the next epidemic outbreak. They are looking for visualization tools that will save them analysis time so they can react quickly.

Headlines

The age profile of all admitted patients is a Normal distribution
Six principal symptoms: abdominal pain, back pain, bleeding nose, diarrhoea, fever and vomiting
95% of people admitted with the 6 symptoms die on the eighth day.
The mortality rate is highest over the peak period: not only is the virus more prevalent, it also appears to be more lethal.

Details
Our approach to this analysis was heavily question-driven: collaboration via group meetings and Google Wave resulted in a number of questions, which were used to inform development of a tool to help address them. The focus was on individual cities; here we demonstrate the answers by using Tolima.

Confirm that there is an epidemic. First we looked at number of patients; secondly deaths. (1) A simple graph showing age against admission (Figure 1.1) demonstrates a near perfect Normal distribution. The graph was plotted directly from the raw files and confirmed that we understood the data format. This led to the question 'what is the size of the effect we're looking for?'

Figure 1.1: Number of patients against Age for Tolima, showing almost a perfect normal distribution

To look at the size of the effect and confirm the epidemic we plotted `date of admission' against `number of deaths', Figure 1.2. In the absence of a significant diseases, we would expect a constant number of deaths per date. However, the graph demonstrates a clear peak which supports the rise and fall of the epidemic. In future it may be possible to determine a baseline state of deaths which could be monitored to highlight when an unusual occurrence (say, epidemic) was occurring. We achieved this by pre-processing the files to give records for only those patients who died, and then graphing the counts against date of admission. Given the time pre-processing took and the likelihood that many such queries would be required, we moved to a database system.

Figure 1.2: Number of deaths vs Date of admission for Tolima, showing the size of the pandemic.

Figure 1.3 demonstrates that some syndromes are more prevalent than others. We focused on the question 'what are the most common symptoms for admissions and deaths?'. This data was much less straightforward to handle - patients are admitted with multiple symptoms, there are no clear separators between symptoms, and spelling and data-entry errors are plentiful. Our categorization groups misspelled words and abbreviations together and was developed through much discussion and data processing.

There seems to be the double-hump pattern in deaths (and admissions, when filtering on symptoms). This pattern can be clearly seen in Figure 1.2 for Tolima. This is present to a degree in all the cities with the disease except Aden. We were unable to detect any difference in syndrome, age or sex over the period of the humps compared to the rest of the data set. This double-hump pattern, though, is typically evidence of an additional effect: it is, for example, clearly visible on graphs of flu - here - where it is due to multiple strains of virus. This would require additional information to investigate.

Figure 1.3: Number of patients against syndrome, with the top nine symptoms labelled.

Six principle symptoms are visible in the deaths graph: abdominal pain (31%), back pain (20%), bleeding nose (10%), diarrhoea (19%), fever (15%) and vomitting (45%), left to right on Figure 1.4. Because patients may be admitted with more than one symptom, these percentages should not be expected to sum to 100. It is significant that, taken together, these six symptoms account for 13,943 deaths out of a total of 16,338 (85%).

Figure 1.4: Deaths against syndrome, showing that there are six clear symptoms present for deaths.

We then became interested in progression of symptoms: are some of them early-stage, and some of them late stage? We graphed days to death (day of death - day of admission), and found that practically no one dies before or after eight days, while 15,572 people (95%) die on the eighth day. We can confirm the symptoms in two further ways. (1) We filtered from this graph, by showing only symptoms for patients who died on day eight (2) We also used the deaths/date graph to filter on the peak period. Neither produced any significant change on the syndrome graph. This gives us reasonable confidence that these six symptoms are the major ones of the disease.

To define the terms 'onset', 'peak' and 'recovery' we chose the statistical definitions of quartiles. We define the onset by the first quartile, the peak by the second and third quartiles, and the recovery by the fourth quartile. For Tolima, the onset is the period before 13th May, the peak the period from 13th to 24th May, and the recovery from 24th May onwards. This provides a solid foundation on which to compare cities.

Figure 1.5: Mortality rate against date of admission for Tolima. Mortality rate rises over the peak period between 14th and 24th May

We define mortality rate as the number of patients admitted on a given day who died divided by the total number of patients admitted on a given day. Further, we filter the data using the six symptoms discussed above. We used a query to produce data to graph against both age and date (Figure 1.5), and considered the result.

The mortality rate is highest over the peak period: not only is the virus more prevalent, it also appears to be more lethal. There are several hypotheses that could explain this: first, as the disease becomes more widespread, the number of admissions with other illnesses might be expected to fall as people avoid hospital for minor ailments. The given six symptoms account for 85% of deaths, but only 219,160 out of 705,281 (31%) of admissions. Second, as admissions rise, far greater strain is placed on the medical facilities, which may contribute to the increase in mortality rate. Additional information would be required to test these hypotheses.

MC2.2: Compare the outbreak across cities. Factors to consider include timing of outbreaks, numbers of people infected and recovery ability of the individual cities. Identify any anomalies you found.

Headlines

Not all cities have the epidemic. Mersin and Nonthaburi do not seem to suffer an outbreak.
The spread is fast
The first three to have an outbreak: Nairobi then Beirut and then Aleppo
Infection progression less clear: Aden, Barcelona, Jedda and Tabriz all seem afflicted at roughly the same time, and Tolima is a clear last.
Most people infected in Karachi, followed by Aleppo, Nairobi, Jedda, Tolima, Tabriz, Beirut, Aden and finally Barcelona.
Order of recovery (fastest first) Aleppo, Aden, Karachi, Tabriz, Barcelona, Nairobi, Tolima, Beirut and Jedda.
Jedda has a mortality rate consistently lower than most of the other countries, while that for Nairobi is typically higher.
mortality rates rise much faster than they decline.

Details

Our approach was the same as in MC2.1: formulate questions and generate views to answer them; provenance information and discussion stored on Google Wave; group discussions aided on the wipe-board.

Outbreaks across cities. Not all cities have the epidemic. This can be confirmed through plotting Normalised Cumulative deaths vs Date of admission, filtered on patients admitted with at least one of the six key symptoms identified in MC2.1, see Figure 2.1.

Two of the given cities (Mersin and Nonthaburi) do not seem to suffer an outbreak. This is confirmed as follows: In MC2.1 we defined the onset as the first quartile, so we can look at timing by looking at the date of Q1 for deaths, filtered on the symptoms of the disease as identified in MC2.1. However, the first thing we noticed was that, looking at a normalized graph of cumulative deaths against dates, each of the other countries compared in Figure 2.1 follow the same accumulative pattern.

Figure 2.1: Normalised Cumulative deaths vs Date of admission, filtered on patients admitted with at least one of the six key symptoms identified in MC2.1

Comparing the syndrome graph for these two countries (Mersin and Nonthaburi) indicates that, for admission, they have identical profiles. This also gives us a baseline to use in our identification of symptoms: those with substantially higher incidence in countries where the disease is present are likely related to the disease. Figure 2.1 shows a comparison between the two un-infected countries and a selection of others.

Figure 2.2: Syndrome differences on admissions between Mersin and Nonthaburi and a selection of other countries. The differences over the six disease symptoms of abdominal pain, back pain, bleeding nose, diarrhoea, fever and vomitting are clear: all are much more prevalent in countries hit by the disease than in those that aren't.

Timing & the spread to different countries. The spread is fast enough that it's difficult to state the full spread with any degree of certainty. However, we are certain where the first case occurs. But, we can compare timing directly by looking at box plots for the countries involved, considering deaths only and filtering on the six symptoms, see Figure 2.3.

Thus, the ordering that this exposes is that, Nairobi shows infected cases first followed by Beirut and then Aleppo. The ordering thereafter is less clear: Aden, Barcelona, Jedda and Tabriz all seem afflicted at roughly the same time, and Tolima is a clear last. Geographically, it may be that from Karachi (or Aleppo, or Beirut) the disease spreads to both South America (Barcelona) and the Middle-East (Tabriz, Aden, Jedda) concurrently.

Figure 2.3: Box plots of deaths with one or more of the six key symptoms. The bottom line of the rectangle indicates Q1, and this gives us an ordering for the spread.

Number of people infected. Considering number of people infected is a matter of filtering on symptoms then viewing the cumulative admissions graph, as shown in Figure 2.4. By far the largest number of people infected is in Karachi, followed by Aleppo, then Nairobi, Jedda, Tolima, Tabriz, Beirut, Aden and finally Barcelona.

Figure 2.4: Cumulative admissions for patients with one or more of the six major symptoms.

In considering recovery ability, it is helpful to define a metric. In our case, we chose the gradient of the line joining the 50th percentile (Q2, median) to the 75th percentile (Q3) - that is, we calculate the rate at which deaths decline over the third quartile. The values for this metric are shown in Figure 3: a larger gradient (more negative number) indicates a faster recovery.

Accordingly, the countries can be ranked on rate of recovery: Aleppo, Aden, Karachi, Tabriz, Barcelona, Nairobi, Tolima, Beirut and Jedda. It is also informative to consider the shape of the mortality rate graphs for these countries, and this is shown in Figure 2.5. Jedda has a mortality rate consistently lower than most of the other countries, while that for Nairobi is typically higher. The graphs tend towards asymmetry, with a positive skew: mortality rates rise much faster than they decline.

Figure 2.5: Mortality rate comparisons between countries. While the overall mortality rate curve is similar, local differences are apparent: in particular, countries with a double-hump death against date graph seem to have a stronger skew here.

Anomalies we've found. There exist a number of other anomalies in the data:

there are an 948 foot injuries for 12-14 year olds in Aleppo between 16th and 18th May
Nairobi has an enormous spike in mortality rate on 6th June, a combination of a steep fall in admissions matched with a fall in deaths consistent with previous values.