Given a set of hospital entry records of patients, we develop a toolkit to present different plots to understand and analyze the symptom patterns. From the analysis, if possible, the aim is to predict a possible outbreak of a disease with details like symptoms, time zones and mortality rates. The toolkit is referred to as BODY (Buckets of Disease sYmptoms) and we describe its utility and our analysis based on BODY's plots and charts.
Understanding challenge data
We began our analysis by manually going through a sample of hospital records given for each city. We observed that the symptom description (provided in each hospital record) does not follow a standard format of description. A few examples are:
In order to standardize the symptoms, we identified 24 classes of
symptoms, by broadly going through the data records. The twenty four
symptom classes along with the set of associated words (commonly used) are compiled manually by looking at data samples.
A symptom class is referred to as a bucket of
disease symptoms (or referred to as bucket) and the table with each bucket
and its set of associated words is
given below:
Bucket Of Disease sYmptom (class) | Set of associated words |
abdomen_probs | abdomen, abd, genital |
accidents | accident, assault, bite |
bleeding | blood, bleed, bled |
breathing_probs | asthma, wheezing, breath, respira |
bodypain | pain, ache, pn, migr, hurts |
cold_cough | cold, cough, sore throat, sinu |
chest_probs | chest |
diabetic_probs | diab |
diarrhea | diarr, gastro |
eye_probs | eye, vision, visual, conjuctiv |
fatigue | fatigue, seizure, weak, lighthead, passing, letharg, light head, dizzy, dizzi |
fever | fever, temp, ill |
heart_probs | heart, cardia, stroke |
injury | inj |
itch_probs | itch, allerg |
infections | infection |
loss | loss |
pregnancy | pregnancy, csection, spotting, c-section, contraction, labor, miscarria |
skin_related | rash, skin, eczema, hives, pox |
stool_probs | stool |
swelling | swell, swollen |
urinary_probs | urin |
vomitting | nausea, vomit |
others |
For each record, we obtain the symptom description (as a string). We check for possible delimiters like ',' and 'and'. Based on the delimiters present, we obtain the list of symptoms mentioned in the description. For each symptom, we check if any of the words (strings) match with the associated set of words. If it matches a particular bucket's associated words, we classify that record into the corresponding bucket. In case of multiple symptoms in the description, the record might be classified into multiple buckets of disease symptoms. The output of this technique is data records of the format: (date, patient information, a set of buckets).
After obtaining the standardized output, we perform a few datacube-like operations on the records to obtain the following information:
We generated
plots from the information obtained to analyse the patterns in
the frequencies of buckets over the time. In regards to this, we display
the plot of day-wise frequency of symptom classes vs. the timeline in Figure 1.
These plots are obtained by analyzing the patient records of hospital entries.
In Figure 1, there are 24 plots, each plot corresponding to each bucket.
Based on our observations - we can separate the plots into three:
1. It can be noticed from the plots corresponding to buckets 'abdomen', 'vomitting',
'diarrhea' and 'loss' have similar frequency characteristics. In the 74 days of
data given, these buckets reached their peak between days 25 to 35 (i.e., between
May 10th, 2009 to May 20th, 2009).
2. Plots of buckets 'bleeding', 'bodypain',
'fever' and 'swelling' also have similarities. These plots too have similar
peaks compared to the earlier set of buckets; ranging between day 25 and 35 (10th
May to 20th May, 2009).
3. The rest of the buckets have more or less similar
patterns. Plots of this type indicate that these symptoms occurred with no
special incidences (no peaks, no indication of the frequencies increasing or
decreasing over a period of time).
In Figure 2, we display the plots of frequency of deaths over time globally (all 11 cities) and city-wise. The figure has 12 plots, the first one representing the global death frequency pattern. It can be seen that the plot peaks between day 32 to day 38 (i.e., May 17th to May 23rd of 2009).
Among all the patient records (across all cities), we obtained the buckets to which each record is classified to (more than one bucket,in case of multiple symptom listing in the symptom description) and then performed frequent patterns algorithm to identify buckets that co-occurred frequently.
The top-5 frequent 1-item buckets are:
1. bodypain, 2. abdomen, 3. fever, 4. vomitting, 5. injury.
The top-3 frequent 2-item buckets are:
1. {vomitting, diarrhea}, 2. {fever, vomitting} and 3. {vomitting, abdomen}.
The top-1 frequent 3-item buckets is {fever, vomitting, diarrhea}.
From these frequent patterns, we conclude that the major symptoms of the disease outbreak are: fever, vomitting, diarrhea.
In detailed answer for MC2.1, we displayed snapshots of bucket plots for global disease outbreak patterns. Below, we display a sample of the snapshot of bucket plots obtained for each city, along with other plots and data analysis (frequent symptoms) performed.
In order to compare and contrast the disease outbreak pattern between cities, we display a list of plots (snapshots) created by BODY toolkit below.
Figure 3 displays the frequency plots of buclets for the city Karachi. The buckets: abdomen, vomitting, diarrhea and loss have similar patterns (peak and the curve). Buckets: bleeding, fever and swelling have similar patterns. We generate such plots for other cities too. Here, we display for the city Karachi, as it has the largest number of hospital entries.
Figure 4 displays the frequency plots of buckets for Thailand. The frequency plots of various symptoms are very different from those of the other cities. A possibility for such patterns could be due to NO disease outbreak in Thailand.
Yet another country that also has different frequency patterns from the rest of the cities, is Turkey. The snapshot of BODY's compilation of plots of Turkey is displayed in Figure 5.
Figure 6 displays the frequency plots of Bucket 'diarrhea' for all the cities. From the figure, it is evident that all the cities except thailand and turkey, have an outbreak of this symptom and the peaks occur at different times in various cities. For Aleppo and Nairobi, the peak is between day 25 and day 30 (i.e., May 10th to May 15th, 2009). For Colombia, Iran and Venezuela, it is between day 30 to day 40 (May 15th to May 25th). For Karachi, Lebanon, Saudi Arabia and Yemen, the peak occurs around day 30 (May 15th, 2009).
We also noticed that from the onset of the disease outbreak to the peak, it is around 20 days (3 weeks) time and the time between the peak of outbreak to the offset is around 4 weeks time.
Due to limits on snapshots, we are only displaying a sample of various types of
plots and snapshots of BODY toolkit. More images and details are available at:
http://research.iiit.ac.in/~soujanya/iiit-gami-mc2/
Frequent Pattern Analysis
We also compute frequent patterns for each city and identified the top-5 singly occurring symptoms, top-3 frequent symptom pairs and top-1 frequent set of three symptoms. These are displayed in Table 2 below. We noticed that for almost all the cities (except Thailand and Turkey), the top-1 frequent set of three symptoms is {diarrhea, fever, vomitting}.
We also calculated the average number of days a person is admitted into hospital before the person's death. The numbers are given in second column of Table 2. On an average it is around 8 days. Saudi Arabia is an exception to have an average of 13 days. Thailand and Turkey have the least average of 5 days each.
City | Avg no of days between admittance and death | Top 5-Single Symptoms | Top 2-2 symptom pairs | Top 1-3 symptom pair |
Aleppo | 8 | bodypain(386609) abdomen(257998) fever(189822) vomitting(179514) itch_probs(14337) | diarrhea,vomitting(49614) fever,vomitting(39612) | diarrhea,fever,vomitting(19115) |
Colombia | 9 | bodypain(125708) abdomen(63306) fever(57654) vomitting(45682) injury(40042) | diarrhea,vomitting(11632) fever,vomitting(10240) | diarrhea,fever,vomitting(4846) |
Iran | 9 | bodypain(97392) abdomen(47393) fever(44243) vomitting(34475) injury(31255) | diarrhea,vomitting(8914) fever,vomitting(7674) | diarrhea,fever,vomitting(3650) |
Karachi | 8 | bodypain(1277551)abdomen(642283) fever(588097) vomitting(465648) injury(406227) | diarrhea,vomitting(120979) fever,vomitting(104024) | diarrhea,fever,vomitting(49691) |
Lebanon | 8 | bodypain(80013) abdomen(34414) fever(35574) injury(25967) vomitting(25532) | diarrhea,vomitting(6358) fever,vomitting(5694) | diarrhea,fever,vomitting(2681) |
Nairobi | 8 | bodypain(222174) abdomen(146260) fever(109223) vomitting(102382) diarrhea(86568) | diarrhea,vomitting(27962) fever,vomitting(22512) | diarrhea,fever,vomitting(10720) |
Saudi Arabia | 13 | bodypain(242360) fever(106591) abdomen(99520) injury(78226) vomitting(74350) | diarrhea,vomitting(18289) fever,vomitting(16773) | diarrhea,fever,vomitting(7987) |
Thailand | 5 | bodypain(16858) fever(6782) injury(5755) eye_probs(4270) skin_related(3959) | fever,vomitting(786) diarrhea,vomitting(658) | No 3 symptom pair |
Turkey | 5 | bodypain(57971) abdomen(13795) fever(11872) vomitting(9818) diarrhea(8235) | fever,vomitting(2706) diarrhea,vomitting(2231) | No 3 symptom pair |
Venezuela | 8 | bodypain(25605) abdomen(13795) fever(11872) vomitting(9818) diarrhea(8235) | diarrhea,vomitting(2586) fever,vomitting(2158) | diarrhea,fever,vomitting(1032) |
Yemen | 8 | bodypain(53491) abdomen(28716) fever(24868) vomitting(20388) diarrhea(16977) | diarrhea,vomitting(5356) fever,vomitting(4578) | diarrhea,fever,vomitting(2176) |
Based on the plots displayed above, we notice that plots of Thailand and Turkey are different from the rest of the cities. This could be either due to (i) no disease outbreak in these countries or (ii) misplaced hospital records / tampered data.
1. Proposing a system of buckets for disease symptoms (BODY). Each bucket of disease symptom
is associated with a set of similar words. These sets of words are compiled manually by us,
after observing at samples of the data.
2. Replacing the free text of symptom description with a standard set of buckets based on the symptom description as a standardization technique.
3. Compiling and grouping plots according to each bucket, each city and globally - to understand different perspective of the data.
4. Using frequent pattern mining algorithm (apriori algorithm) to identify most frequent symptoms and frequent co-occurring symptoms.
5. From our analysis, we identify the major symptoms of the disease (having
an outbreak) to be {bodypain, fever, vomitting, diarrhea}.
More details at: http://research.iiit.ac.in/~soujanya/iiit-gami-mc2/