Submission ID: 174
Team Members: Hanisha Veeramachaneni, Soujanya Vadapalli, Kamalakar Karlapalem
Institute: International Institute of Information Technology, Hyderabad, INDIA
Primary Email Contact: soujanya@iiit.ac.in

Video: Click here

Mini Challenge 2.1: Detailed Answer

BODY - Buckets Of Disease sYmptoms for Analysis

Given a set of hospital entry records of patients, we develop a toolkit to present different plots to understand and analyze the symptom patterns. From the analysis, if possible, the aim is to predict a possible outbreak of a disease with details like symptoms, time zones and mortality rates. The toolkit is referred to as BODY (Buckets of Disease sYmptoms) and we describe its utility and our analysis based on BODY's plots and charts.

Understanding challenge data

We began our analysis by manually going through a sample of hospital records given for each city. We observed that the symptom description (provided in each hospital record) does not follow a standard format of description. A few examples are:

In order to standardize the symptoms, we identified 24 classes of symptoms, by broadly going through the data records. The twenty four symptom classes along with the set of associated words (commonly used) are compiled manually by looking at data samples. A symptom class is referred to as a bucket of disease symptoms (or referred to as bucket) and the table with each bucket and its set of associated words is given below:

Bucket Of Disease sYmptom (class)Set of associated words
abdomen_probsabdomen, abd, genital
accidentsaccident, assault, bite
bleedingblood, bleed, bled
breathing_probsasthma, wheezing, breath, respira
bodypainpain, ache, pn, migr, hurts
cold_coughcold, cough, sore throat, sinu
chest_probschest
diabetic_probsdiab
diarrheadiarr, gastro
eye_probseye, vision, visual, conjuctiv
fatiguefatigue, seizure, weak, lighthead, passing, letharg, light head, dizzy, dizzi
feverfever, temp, ill
heart_probsheart, cardia, stroke
injuryinj
itch_probsitch, allerg
infectionsinfection
lossloss
pregnancypregnancy, csection, spotting, c-section, contraction, labor, miscarria
skin_relatedrash, skin, eczema, hives, pox
stool_probsstool
swellingswell, swollen
urinary_probsurin
vomittingnausea, vomit
others 

Data preparation

For each record, we obtain the symptom description (as a string). We check for possible delimiters like ',' and 'and'. Based on the delimiters present, we obtain the list of symptoms mentioned in the description. For each symptom, we check if any of the words (strings) match with the associated set of words. If it matches a particular bucket's associated words, we classify that record into the corresponding bucket. In case of multiple symptoms in the description, the record might be classified into multiple buckets of disease symptoms. The output of this technique is data records of the format: (date, patient information, a set of buckets).

Analysis

After obtaining the standardized output, we perform a few datacube-like operations on the records to obtain the following information:

  1. Day-wise frequency of symptoms
  2. Day-wise frequency of deaths
  3. Day-wise frequency of symptoms of patients which eventually caused death of the patients.
  4. All the above plots, city-wise (discussed in MC 2.2)

We generated plots from the information obtained to analyse the patterns in the frequencies of buckets over the time. In regards to this, we display the plot of day-wise frequency of symptom classes vs. the timeline in Figure 1. These plots are obtained by analyzing the patient records of hospital entries. In Figure 1, there are 24 plots, each plot corresponding to each bucket. Based on our observations - we can separate the plots into three:
1. It can be noticed from the plots corresponding to buckets 'abdomen', 'vomitting', 'diarrhea' and 'loss' have similar frequency characteristics. In the 74 days of data given, these buckets reached their peak between days 25 to 35 (i.e., between May 10th, 2009 to May 20th, 2009).
2. Plots of buckets 'bleeding', 'bodypain', 'fever' and 'swelling' also have similarities. These plots too have similar peaks compared to the earlier set of buckets; ranging between day 25 and 35 (10th May to 20th May, 2009).
3. The rest of the buckets have more or less similar patterns. Plots of this type indicate that these symptoms occurred with no special incidences (no peaks, no indication of the frequencies increasing or decreasing over a period of time).

Figure 1: Day-wise frequency of buckets vs. timeline

Figure 1: Day-wise frequency of buckets vs. timeline

In Figure 2, we display the plots of frequency of deaths over time globally (all 11 cities) and city-wise. The figure has 12 plots, the first one representing the global death frequency pattern. It can be seen that the plot peaks between day 32 to day 38 (i.e., May 17th to May 23rd of 2009).

Figure 2: Day-wise frequency of deaths (globally and city-wise) vs. timeline

Figure 2: Day-wise frequency of deaths (globally and city-wise) vs. timeline
Frequently Occurring Symptoms

Among all the patient records (across all cities), we obtained the buckets to which each record is classified to (more than one bucket,in case of multiple symptom listing in the symptom description) and then performed frequent patterns algorithm to identify buckets that co-occurred frequently.

The top-5 frequent 1-item buckets are:
1. bodypain, 2. abdomen, 3. fever, 4. vomitting, 5. injury.

The top-3 frequent 2-item buckets are:
1. {vomitting, diarrhea}, 2. {fever, vomitting} and 3. {vomitting, abdomen}.

The top-1 frequent 3-item buckets is {fever, vomitting, diarrhea}.

From these frequent patterns, we conclude that the major symptoms of the disease outbreak are: fever, vomitting, diarrhea.

Mini Challenge 2.2: Detailed Answer

In detailed answer for MC2.1, we displayed snapshots of bucket plots for global disease outbreak patterns. Below, we display a sample of the snapshot of bucket plots obtained for each city, along with other plots and data analysis (frequent symptoms) performed.

Analysis

In order to compare and contrast the disease outbreak pattern between cities, we display a list of plots (snapshots) created by BODY toolkit below.

Figure 3 displays the frequency plots of buclets for the city Karachi. The buckets: abdomen, vomitting, diarrhea and loss have similar patterns (peak and the curve). Buckets: bleeding, fever and swelling have similar patterns. We generate such plots for other cities too. Here, we display for the city Karachi, as it has the largest number of hospital entries.

Figure 3: Day-wise buckets frequency for Karachi

Figure 3: Day-wise frequency of buckets for Karachi

Figure 4 displays the frequency plots of buckets for Thailand. The frequency plots of various symptoms are very different from those of the other cities. A possibility for such patterns could be due to NO disease outbreak in Thailand.

Yet another country that also has different frequency patterns from the rest of the cities, is Turkey. The snapshot of BODY's compilation of plots of Turkey is displayed in Figure 5.

Figure 4: Day-wise buckets frequency for Thailand

Figure 4: Day-wise buckets frequency for Thailand

Figure 5: Day-wise buckets frequency for Turkey

Figure 5: Day-wise buckets frequency for Turkey

Figure 6 displays the frequency plots of Bucket 'diarrhea' for all the cities. From the figure, it is evident that all the cities except thailand and turkey, have an outbreak of this symptom and the peaks occur at different times in various cities. For Aleppo and Nairobi, the peak is between day 25 and day 30 (i.e., May 10th to May 15th, 2009). For Colombia, Iran and Venezuela, it is between day 30 to day 40 (May 15th to May 25th). For Karachi, Lebanon, Saudi Arabia and Yemen, the peak occurs around day 30 (May 15th, 2009).

We also noticed that from the onset of the disease outbreak to the peak, it is around 20 days (3 weeks) time and the time between the peak of outbreak to the offset is around 4 weeks time.

Figure 6: 'diarrhea' patterns in various cities

Figure 6: 'diarrhea' patterns in various cities

Due to limits on snapshots, we are only displaying a sample of various types of plots and snapshots of BODY toolkit. More images and details are available at: http://research.iiit.ac.in/~soujanya/iiit-gami-mc2/

Frequent Pattern Analysis

We also compute frequent patterns for each city and identified the top-5 singly occurring symptoms, top-3 frequent symptom pairs and top-1 frequent set of three symptoms. These are displayed in Table 2 below. We noticed that for almost all the cities (except Thailand and Turkey), the top-1 frequent set of three symptoms is {diarrhea, fever, vomitting}.

We also calculated the average number of days a person is admitted into hospital before the person's death. The numbers are given in second column of Table 2. On an average it is around 8 days. Saudi Arabia is an exception to have an average of 13 days. Thailand and Turkey have the least average of 5 days each.

Table 2: Summary of buckets occurrence city-wise
CityAvg no of days between admittance and deathTop 5-Single SymptomsTop 2-2 symptom pairsTop 1-3 symptom pair
Aleppo8bodypain(386609) abdomen(257998) fever(189822) vomitting(179514) itch_probs(14337)diarrhea,vomitting(49614) fever,vomitting(39612)diarrhea,fever,vomitting(19115)
Colombia9bodypain(125708) abdomen(63306) fever(57654) vomitting(45682) injury(40042)diarrhea,vomitting(11632) fever,vomitting(10240)diarrhea,fever,vomitting(4846)
Iran9bodypain(97392) abdomen(47393) fever(44243) vomitting(34475) injury(31255)diarrhea,vomitting(8914) fever,vomitting(7674)diarrhea,fever,vomitting(3650)
Karachi8bodypain(1277551)abdomen(642283) fever(588097) vomitting(465648) injury(406227)diarrhea,vomitting(120979) fever,vomitting(104024)diarrhea,fever,vomitting(49691)
Lebanon8bodypain(80013) abdomen(34414) fever(35574) injury(25967) vomitting(25532)diarrhea,vomitting(6358) fever,vomitting(5694)diarrhea,fever,vomitting(2681)
Nairobi8bodypain(222174) abdomen(146260) fever(109223) vomitting(102382) diarrhea(86568)diarrhea,vomitting(27962) fever,vomitting(22512)diarrhea,fever,vomitting(10720)
Saudi Arabia13bodypain(242360) fever(106591) abdomen(99520) injury(78226) vomitting(74350)diarrhea,vomitting(18289) fever,vomitting(16773)diarrhea,fever,vomitting(7987)
Thailand5bodypain(16858) fever(6782) injury(5755) eye_probs(4270) skin_related(3959)fever,vomitting(786) diarrhea,vomitting(658)No 3 symptom pair
Turkey5bodypain(57971) abdomen(13795) fever(11872) vomitting(9818) diarrhea(8235)fever,vomitting(2706) diarrhea,vomitting(2231)No 3 symptom pair
Venezuela8bodypain(25605) abdomen(13795) fever(11872) vomitting(9818) diarrhea(8235)diarrhea,vomitting(2586) fever,vomitting(2158)diarrhea,fever,vomitting(1032)
Yemen8bodypain(53491) abdomen(28716) fever(24868) vomitting(20388) diarrhea(16977) diarrhea,vomitting(5356) fever,vomitting(4578)diarrhea,fever,vomitting(2176)
Thailand and Turkey

Based on the plots displayed above, we notice that plots of Thailand and Turkey are different from the rest of the cities. This could be either due to (i) no disease outbreak in these countries or (ii) misplaced hospital records / tampered data.

Summary

Our main contributions are:

1. Proposing a system of buckets for disease symptoms (BODY). Each bucket of disease symptom is associated with a set of similar words. These sets of words are compiled manually by us, after observing at samples of the data.

2. Replacing the free text of symptom description with a standard set of buckets based on the symptom description as a standardization technique.

3. Compiling and grouping plots according to each bucket, each city and globally - to understand different perspective of the data.

4. Using frequent pattern mining algorithm (apriori algorithm) to identify most frequent symptoms and frequent co-occurring symptoms.

5. From our analysis, we identify the major symptoms of the disease (having an outbreak) to be {bodypain, fever, vomitting, diarrhea}.

More details at: http://research.iiit.ac.in/~soujanya/iiit-gami-mc2/