Submission ID: 174
Team Members: Hanisha Veeramachaneni, Soujanya Vadapalli, Kamalakar Karlapalem
Institute: International Institute of Information Technology, Hyderabad, INDIA
Primary Email Contact: soujanya@iiit.ac.in

Video: Click here

Mini Challenge 2.1: Detailed Answer

BODY - Buckets Of Disease sYmptoms for Analysis

Given a set of hospital entry records of patients, we develop a toolkit to present different plots to understand and analyze the symptom patterns. From the analysis, if possible, the aim is to predict a possible outbreak of a disease with details like symptoms, time zones and mortality rates. The toolkit is referred to as BODY (Buckets of Disease sYmptoms) and we describe its utility and our analysis based on BODY's plots and charts.

Understanding challenge data

We began our analysis by manually going through a sample of hospital records given for each city. We observed that the symptom description (provided in each hospital record) does not follow a standard format of description. A few examples are:

a symptom is described using different associated words or different forms of words.
Example: blurred vision, blurry vision, blurried vision
a symptom is described using different placement of words within the sentence.
Example: Stuffy nose, Nose Stuffy
spelling mistakes, repetition of words and typos
Examples: "abd painvomiting", "feverfever"
in case of multiple symptoms, there is again no standard format in delimiting them. Examples: "headache blurred vision", "vomitingheadache"

In order to standardize the symptoms, we identified 24 classes of symptoms, by broadly going through the data records. The twenty four symptom classes along with the set of associated words (commonly used) are compiled manually by looking at data samples. A symptom class is referred to as a bucket of disease symptoms (or referred to as bucket) and the table with each bucket and its set of associated words is given below:

Bucket Of Disease sYmptom (class) Set of associated words

abdomen_probs abdomen, abd, genital

accidents accident, assault, bite

bleeding blood, bleed, bled

breathing_probs asthma, wheezing, breath, respira

bodypain pain, ache, pn, migr, hurts

cold_cough cold, cough, sore throat, sinu

chest_probs chest

diabetic_probs diab

diarrhea diarr, gastro

eye_probs eye, vision, visual, conjuctiv

fatigue fatigue, seizure, weak, lighthead, passing, letharg, light head, dizzy, dizzi

fever fever, temp, ill

heart_probs heart, cardia, stroke

injury inj

itch_probs itch, allerg

infections infection

loss loss

pregnancy pregnancy, csection, spotting, c-section, contraction, labor, miscarria

skin_related rash, skin, eczema, hives, pox

stool_probs stool

swelling swell, swollen

urinary_probs urin

vomitting nausea, vomit

others

Data preparation

For each record, we obtain the symptom description (as a string). We check for possible delimiters like ',' and 'and'. Based on the delimiters present, we obtain the list of symptoms mentioned in the description. For each symptom, we check if any of the words (strings) match with the associated set of words. If it matches a particular bucket's associated words, we classify that record into the corresponding bucket. In case of multiple symptoms in the description, the record might be classified into multiple buckets of disease symptoms. The output of this technique is data records of the format: (date, patient information, a set of buckets).

Analysis

After obtaining the standardized output, we perform a few datacube-like operations on the records to obtain the following information:

Day-wise frequency of symptoms
Day-wise frequency of deaths
Day-wise frequency of symptoms of patients which eventually caused death of the patients.
All the above plots, city-wise (discussed in MC 2.2)

We generated plots from the information obtained to analyse the patterns in the frequencies of buckets over the time. In regards to this, we display the plot of day-wise frequency of symptom classes vs. the timeline in Figure 1. These plots are obtained by analyzing the patient records of hospital entries. In Figure 1, there are 24 plots, each plot corresponding to each bucket. Based on our observations - we can separate the plots into three:
1. It can be noticed from the plots corresponding to buckets 'abdomen', 'vomitting', 'diarrhea' and 'loss' have similar frequency characteristics. In the 74 days of data given, these buckets reached their peak between days 25 to 35 (i.e., between May 10th, 2009 to May 20th, 2009).
2. Plots of buckets 'bleeding', 'bodypain', 'fever' and 'swelling' also have similarities. These plots too have similar peaks compared to the earlier set of buckets; ranging between day 25 and 35 (10th May to 20th May, 2009).
3. The rest of the buckets have more or less similar patterns. Plots of this type indicate that these symptoms occurred with no special incidences (no peaks, no indication of the frequencies increasing or decreasing over a period of time).

Figure 1: Day-wise frequency of buckets vs. timeline

In Figure 2, we display the plots of frequency of deaths over time globally (all 11 cities) and city-wise. The figure has 12 plots, the first one representing the global death frequency pattern. It can be seen that the plot peaks between day 32 to day 38 (i.e., May 17th to May 23rd of 2009).

Figure 2: Day-wise frequency of deaths (globally and city-wise) vs. timeline

Figure 2: Day-wise frequency of deaths (globally and city-wise) vs. timeline Frequently Occurring Symptoms

Among all the patient records (across all cities), we obtained the buckets to which each record is classified to (more than one bucket,in case of multiple symptom listing in the symptom description) and then performed frequent patterns algorithm to identify buckets that co-occurred frequently.

The top-5 frequent 1-item buckets are:
1. bodypain, 2. abdomen, 3. fever, 4. vomitting, 5. injury.

The top-3 frequent 2-item buckets are:
1. {vomitting, diarrhea}, 2. {fever, vomitting} and 3. {vomitting, abdomen}.

The top-1 frequent 3-item buckets is {fever, vomitting, diarrhea}.

From these frequent patterns, we conclude that the major symptoms of the disease outbreak are: fever, vomitting, diarrhea.

Mini Challenge 2.2: Detailed Answer

In detailed answer for MC2.1, we displayed snapshots of bucket plots for global disease outbreak patterns. Below, we display a sample of the snapshot of bucket plots obtained for each city, along with other plots and data analysis (frequent symptoms) performed.

Analysis

In order to compare and contrast the disease outbreak pattern between cities, we display a list of plots (snapshots) created by BODY toolkit below.

Figure 3 displays the frequency plots of buclets for the city Karachi. The buckets: abdomen, vomitting, diarrhea and loss have similar patterns (peak and the curve). Buckets: bleeding, fever and swelling have similar patterns. We generate such plots for other cities too. Here, we display for the city Karachi, as it has the largest number of hospital entries.

Figure 3: Day-wise buckets frequency for Karachi

Figure 3: Day-wise frequency of buckets for Karachi

Figure 4 displays the frequency plots of buckets for Thailand. The frequency plots of various symptoms are very different from those of the other cities. A possibility for such patterns could be due to NO disease outbreak in Thailand.

Yet another country that also has different frequency patterns from the rest of the cities, is Turkey. The snapshot of BODY's compilation of plots of Turkey is displayed in Figure 5.

Figure 4: Day-wise buckets frequency for Thailand

Figure 4: Day-wise buckets frequency for Thailand

Figure 5: Day-wise buckets frequency for Turkey

Figure 5: Day-wise buckets frequency for Turkey

Figure 6 displays the frequency plots of Bucket 'diarrhea' for all the cities. From the figure, it is evident that all the cities except thailand and turkey, have an outbreak of this symptom and the peaks occur at different times in various cities. For Aleppo and Nairobi, the peak is between day 25 and day 30 (i.e., May 10th to May 15th, 2009). For Colombia, Iran and Venezuela, it is between day 30 to day 40 (May 15th to May 25th). For Karachi, Lebanon, Saudi Arabia and Yemen, the peak occurs around day 30 (May 15th, 2009).

We also noticed that from the onset of the disease outbreak to the peak, it is around 20 days (3 weeks) time and the time between the peak of outbreak to the offset is around 4 weeks time.

Figure 6: 'diarrhea' patterns in various cities

Due to limits on snapshots, we are only displaying a sample of various types of plots and snapshots of BODY toolkit. More images and details are available at: http://research.iiit.ac.in/~soujanya/iiit-gami-mc2/

Frequent Pattern Analysis

We also compute frequent patterns for each city and identified the top-5 singly occurring symptoms, top-3 frequent symptom pairs and top-1 frequent set of three symptoms. These are displayed in Table 2 below. We noticed that for almost all the cities (except Thailand and Turkey), the top-1 frequent set of three symptoms is {diarrhea, fever, vomitting}.

We also calculated the average number of days a person is admitted into hospital before the person's death. The numbers are given in second column of Table 2. On an average it is around 8 days. Saudi Arabia is an exception to have an average of 13 days. Thailand and Turkey have the least average of 5 days each.

Table 2: Summary of buckets occurrence city-wise

City	Avg no of days between admittance and death	Top 5-Single Symptoms	Top 2-2 symptom pairs	Top 1-3 symptom pair
Aleppo	8	bodypain(386609) abdomen(257998) fever(189822) vomitting(179514) itch_probs(14337)	diarrhea,vomitting(49614) fever,vomitting(39612)	diarrhea,fever,vomitting(19115)
Colombia	9	bodypain(125708) abdomen(63306) fever(57654) vomitting(45682) injury(40042)	diarrhea,vomitting(11632) fever,vomitting(10240)	diarrhea,fever,vomitting(4846)
Iran	9	bodypain(97392) abdomen(47393) fever(44243) vomitting(34475) injury(31255)	diarrhea,vomitting(8914) fever,vomitting(7674)	diarrhea,fever,vomitting(3650)
Karachi	8	bodypain(1277551)abdomen(642283) fever(588097) vomitting(465648) injury(406227)	diarrhea,vomitting(120979) fever,vomitting(104024)	diarrhea,fever,vomitting(49691)
Lebanon	8	bodypain(80013) abdomen(34414) fever(35574) injury(25967) vomitting(25532)	diarrhea,vomitting(6358) fever,vomitting(5694)	diarrhea,fever,vomitting(2681)
Nairobi	8	bodypain(222174) abdomen(146260) fever(109223) vomitting(102382) diarrhea(86568)	diarrhea,vomitting(27962) fever,vomitting(22512)	diarrhea,fever,vomitting(10720)
Saudi Arabia	13	bodypain(242360) fever(106591) abdomen(99520) injury(78226) vomitting(74350)	diarrhea,vomitting(18289) fever,vomitting(16773)	diarrhea,fever,vomitting(7987)
Thailand	5	bodypain(16858) fever(6782) injury(5755) eye_probs(4270) skin_related(3959)	fever,vomitting(786) diarrhea,vomitting(658)	No 3 symptom pair
Turkey	5	bodypain(57971) abdomen(13795) fever(11872) vomitting(9818) diarrhea(8235)	fever,vomitting(2706) diarrhea,vomitting(2231)	No 3 symptom pair
Venezuela	8	bodypain(25605) abdomen(13795) fever(11872) vomitting(9818) diarrhea(8235)	diarrhea,vomitting(2586) fever,vomitting(2158)	diarrhea,fever,vomitting(1032)
Yemen	8	bodypain(53491) abdomen(28716) fever(24868) vomitting(20388) diarrhea(16977)	diarrhea,vomitting(5356) fever,vomitting(4578)	diarrhea,fever,vomitting(2176)

Thailand and Turkey

Based on the plots displayed above, we notice that plots of Thailand and Turkey are different from the rest of the cities. This could be either due to (i) no disease outbreak in these countries or (ii) misplaced hospital records / tampered data.

Summary

Our main contributions are:

1. Proposing a system of buckets for disease symptoms (BODY). Each bucket of disease symptom is associated with a set of similar words. These sets of words are compiled manually by us, after observing at samples of the data.

2. Replacing the free text of symptom description with a standard set of buckets based on the symptom description as a standardization technique.

3. Compiling and grouping plots according to each bucket, each city and globally - to understand different perspective of the data.

4. Using frequent pattern mining algorithm (apriori algorithm) to identify most frequent symptoms and frequent co-occurring symptoms.

5. From our analysis, we identify the major symptoms of the disease (having an outbreak) to be {bodypain, fever, vomitting, diarrhea}.

More details at: http://research.iiit.ac.in/~soujanya/iiit-gami-mc2/

Bucket Of Disease sYmptom (class)	Set of associated words
abdomen_probs	abdomen, abd, genital
accidents	accident, assault, bite
bleeding	blood, bleed, bled
breathing_probs	asthma, wheezing, breath, respira
bodypain	pain, ache, pn, migr, hurts
cold_cough	cold, cough, sore throat, sinu
chest_probs	chest
diabetic_probs	diab
diarrhea	diarr, gastro
eye_probs	eye, vision, visual, conjuctiv
fatigue	fatigue, seizure, weak, lighthead, passing, letharg, light head, dizzy, dizzi
fever	fever, temp, ill
heart_probs	heart, cardia, stroke
injury	inj
itch_probs	itch, allerg
infections	infection
loss	loss
pregnancy	pregnancy, csection, spotting, c-section, contraction, labor, miscarria
skin_related	rash, skin, eczema, hives, pox
stool_probs	stool
swelling	swell, swollen
urinary_probs	urin
vomitting	nausea, vomit
others