CoEP2-Hospitalization Record Analyzer-MC2

VAST 2010 Challenge
Hospitalization Records -  Characterization of Pandemic Spread

Authors and Affiliations:

Prashant Chaudhary, College of Engineering, Pune [Primary Contact] Email : prash.c.29@gmail.com

Sonali Rahagude, College of Engineering, Pune Email : sonalirahagude@gmail.com

Gaurish Chaudhari, College of Engineering, Pune Email : gsc.chaudhari@gmail.com

Mrs. Vahida Attar, College Engineering, Pune [Faculty Advisor] Email : vahida.comp@coep.ac.in

Tool(s): The Hospitalization Record Analyzer

We developed a tool to analyze preprocessed data specifically for Mini challenge2. The tool is an open source tool built in Java. It gives values of various factors using filters of city and syndromes on the data set. It shows plots of the processed data also. These graphs are drawn using open source graph plotting software – GNUplot (http://www.gnuplot.info/). The tool analyzes the preprocessed data in variety of ways. These include :

1.      City-wise analysis

Analysis for each city can done which includes :

- Distribution of no. of dead/infected over week.(Shows a graph blue-infected, red- dead)

- Distribution of no. of dead/infected over age group. (Shows a graph blue-infected, red-dead)

Furthermore, a particular city may be analyzed for a particular syndrome which includes :

§  Shows the graph for syndromes vs no of people dead/infected. Top 3 most grave syndromes are labeled and the syndrome selected presently is also labeled.

§  Gives the values for following terms:

- No. of males infected/dead by the syndrome

- No. of females infected/dead by the syndrome

- Age group peak infected/dead by the syndrome

- Week peak infected/dead by the syndrome

2.      Overall analysis

This includes analysis for the entire data set :

- Distribution of no. of dead/infected over week(Shows a graph blue-infected, red- dead)

- Distribution of no. of dead/infected over age groups(Shows a graph blue-infected, red-dead)

Also, a syndrome wise classification for the entire data set is also provided. It includes

§  Shows the graph for syndromes vs no of people dead/infected. Top 3 most grave syndromes are labeled and the syndrome selected presently is also labeled.

§  Gives the values for following terms:

- Total no. of infected/dead

- Age group peak infected/dead

- Peak week infected/dead

Video:

here is a short video describing analysis and the use of our tool

video

ANSWERS:

MC2.1: Analyze the records you have been given to characterize the spread of the disease.  You should take into consideration symptoms of the disease, mortality rates, temporal patterns of the onset, peak and recovery of the disease.  Health officials hope that whatever tools are developed to analyze this data might be available for the next epidemic outbreak.  They are looking for visualization tools that will save them analysis time so they can react quickly.

1. DATA PREPROCESSING :

1.1 Integration of admittance and death files

  To work on a single file rather than 2, we first integrated the separate 2 files of admittance and death records and create a new file by adding a new attribute STAT (dead/alive) and DEATHDATE . JAVA code (Integrate.java) was written to read do this.

1.2 Preprocessing of Symptoms

Same symptom was represented in different format. Eg. ABDOMINAL PAIN represented as ABD PAIN or ABD. PAIN or ABD PX etc. Our goal was to remove different representations of same symptom. This would help to reduce biasing during classification.

The following procedure was followed :

STEP 1 :

We wrote a code in Java (Symtom.java) to give distinct symptoms in given file. The result of above code was written in a new file which contained only the distinct symptoms.

STEP 2 :

The symptoms were analyzed. We decided on a common representation of the different representations of same symptom . Eg. ABD PAIN was decided for all representations of abdominal pain.

STEP 3 :

 Now, the above common representation was added in front of all the representations. Eg. for abdominal pain the entries looked like :

ABDOMINAL PAIN#ABD PAIN

ABD.PAIN#ABD PAIN

ABD PX#ABD PAIN

STEP 4 :

            A new JAVA code (Findreplace.java) was created to read above entries and update them. Again we ran Symtom.java on the new file. Now, The o/p file contained less distinct symptoms. We followed this procedure for 10 versions and finally the number of distinct symptoms was reduced from approx. 1312 to 77. We could now use these 77 distinct syndromes for classification.

 

1.2 INTEGRATING INFORMATION FROM ALL CITIES

The analysis was done to be done for overall data set while the data set provided contained information city-wise. Hence corresponding files of all cities were combined into one single file (all.csv). Codes as described below were then run on this single file :

CODES :

i) Classify_age.java

Classify the data(infected/dead) according to the age groups and store it in a particular file. (all_age.csv)

ii) Classify_week.java

Classify the data(infected/dead) according to the weeks and store it in a particular file. (all_week.csv)

iii) Syndrome_classify.java

Classify the data according to syndromes. Store result in all_sym.csv

iv) Syndrome_classify_week.java

Classify the data according to syndromes and weeks. Store result in all_sym_week.csv

iv) Syndrome_classify_age.java

Classify the data according to syndromes and age groups. Store result in all_sym_age.csv

This initial classification of data helped us create files pertaining to various factors such as syndromes, age group, week etc. These could now to be directly used for analysis and are used by our tool.

 

2. ANALYSIS FOR CHARACTERIZATION OF SPREAD OF DISEASE

We now used our tool to analyze preprocessed data.

To characterize the spread of the disease :

2.1 We first accounted for classification acc. to syndromes :

- By studying the graphs counted the no of infected and dead males and females for the particular syndrome.

- We found out the age-group that had maximum infected patients and the one that had maximum dead patients for this syndrome. This will help the officials find which age group is more prone to the disease.

- Next, we found out the week when there were maximum no of admittance as also when there were deaths exclusively for the selected syndrome. This is useful in finding the temporal pattern of the disease for that particular syndrome.

Screenshot 1 & 2: It shows the data analyzed for a particular syndrome selected. It also gives a plot of no. of dead/infected to syndromes. The selected syndrome is marked with an arrow. The various syndromes for people infected by the disease are compared in the graph. From this graph, we could determine syndromes with high no. of dead people. Thus, the grave symptoms were found out.

Screenshot 1 : Shows the statistics for the syndrome of back pain. High resolution image here.

Screenshot1r.png

Screenshot 2 : Shows no. of infected & dead for various syndromes. High resolution image here.

Screenshot2r.png

List of grave symptoms

SYNDROME

NO. OF DEAD PEOPLE

NO. OF AFFECTED PEOPLE

ABD pain

70987

943441

Vomiting

63426

897332

Vomiting, diarrhea

55983

631516

Back pain

41177

628392

Vomiting, ABD pain

33393

339066

Nose problems

18511

291520

Diarrhea

11336

170182

 

2.2 Classification was done according to mortality rate :

We calculated mortality rate in 2 ways: a) no of deaths per age group

                      b) no of deaths per week

A plot of both was drawn. Using these, mortality rate was found out for each group and listed. Mortality rate for each week during the pandemic was also listed. This gave us an idea of age group with high mortality rate. We could also infer the time in the epidemic period when there were maximum no. of casaulties. Screenshots 3 and 4 depict the results.

list of mortality rate (age-group wise)

AGE GROUP

NO. OF DEATHS

0-9

397

10-19

4390

20-29

28975

30-39

92569

40-49

128740

50-59

78162

60-69

21197

70-79

2786

80-89

253

90-99

0

list of mortality rates (week wise)

WEEK

NO. OF DEATHS

04/16 - 04/22

382

04/23 - 04/29

1112

04/30 - 05/06

6141

05/07 - 05/13

37130

05/14 - 05/20

79080

05/21 - 05/27

115790

05/28 - 06/04

84384

06/05 – 06/11

23488

06/12 – 06/18

6785

06/19 – 06/25

2262

06/26 – 07/02

915

 

2.3 The temporal patterns for disease were calculated:

We calculated the no. of infected/dead people week-wise. This plot gave us an idea of the spread of the disease on the onset, peak and recovery phases ie. the entire duration for which the epidemic lasted. This is depicted is Screenshot 4. A list shows the weeks included in the onset, peak and recovery phases of the pandemic.

PHASE

WEEK

NO. OF INFECTED

ONSET

04/16 - 04/22

1149488

ONSET

04/23 - 04/29

1119045

ONSET

04/30 - 05/06

1404034

PEAK

05/07 - 05/13

1836814

PEAK

05/14 - 05/20

1930903

PEAK

05/21 - 05/27

1630616

PEAK

05/28 - 06/04

1462458

RECOVERY

06/05 – 06/11

1289248

RECOVERY

06/12 – 06/18

990430

RECOVERY

06/19 – 06/25

1099755

RECOVERY

06/26 – 07/02

631157

Screenshot 3 : This screenshot shows the plot of infected/dead vs age group. High resolution image here.

Screenshot3r.png 

Screenshot 4 : This screenshot is a plot of no. of dead/infected vs weeks. High resolution image here.

Screenshot4r.png

 

MC2.2:  Compare the outbreak across cities.  Factors to consider include timing of outbreaks, numbers of people infected and recovery ability of the individual cities.  Identify any anomalies you found.

1. DATA PREPROCESSING :

            In this case, data classified on the basis of city was required. Hence, the codes mentioned in the previous answer such as Classify_age.java, Syndrome_classify.java etc. were run on the individual city files and a set of new files pertaining to each city was formed in a different folder. These included :

cityname_sym.csv

It lists the no. of dead and infected, syndrome-wise

cityname_age.csv

It lists the no. of dead and infected, age-group-wise

cityname_week.csv

It lists the no. of dead and infected, week-wise

cityname_sym_week.csv

It lists the no. of infected with a particular syndrome, week-wise                                                                                 

cityname_sym_age.csv

It lists the no. of infected with a particular syndrome, age-grp wise

cities

It lists no. of infected and dead, city-wise

 

2. COMPARING OUTBREAK ACROSS CITIES

            We now needed to compare figures of all cities. We compared the outburst across cities on the grounds of various factors as mentioned below :

2.1 Cities were compared on basis of time of outburst

            We treated the timing of outburst of the epidemic to be the time when most of the people were infected with the disease. Hence, we calculated peak infected week for each city. A list is given below.

CITY NAME

PEAK INFECTED WEEK

Aleppo, Syria

  05/14 – 05/20

Tolima, Colombia

  05/14 – 05/20

Tabirz, Iran

  05/21 – 05/27

Karachi, Pakistan

  05/14 – 05/20

Beirut, Lebanon

  05/28 – 06/04

Nairobi, Kenya

  05/14 – 05/20

Jedda, Saudi Arabia

  05/07 – 05/13

Nonthaburi. Thailand

  05/28 – 06/04

Mersin, Turkey

  05/28 – 06/04

Barcelona, Venezuela

  05/14 – 05/20

Aden, Yemen

  05/14 – 05/20

 

2.2 Cities were compared on basis no. of infected people

            We made a plot of cities versus the no. of infected people. This plot is shown in Screenshot 1. This helped us compare the intensity of the epidemic of the cities i.e. which cities were more affected because of the epidemic. A list is provided below.

Screenshot 1 : City-wise distribution of infected/dead people. High resolution image here.

Screenshot2.1r.png

Cities most affected

CITY NAME

NO. OF INFECTED PEOPLE

Karachi, Pakistan

7154925

Aleppo, Syria

2242648

Jedda, Saudi Arabia

1327563

 

2.3 Recovery ability of individual cities were found out

Ability :

            We calculated the ratio of no. of dead to no. of infected people for individual cities and compared them. This ratio was used to estimate as the recovery ability of a city. So, a city with less ratio had recovered better. It helped us find out cities that were more immune to the epidemic and cities that were sensitive to the epidemic. A list of cities and their ratio is provided below.

CITY

RATIO (DEAD / INFECTED)

Mersin, Turkey

0.00092

Nonthaburi, Thailand

0.00098

Jedda, Saudi Arabia

0.01622

Beirut, Lebanon

0.01733

Tabriz, Iran

0.02196

Karachi, Pakistan

0.02314

Tolima, Colombia

0.02317

Aden, Yemen

0.02555

Barcelona, Venezuela

0.02585

Nairobi. Kenya

0.03405

Aleppo, Syria

0.03508

List of immune cities

Nonathaburi Thailand, Mesrin Turkey

List of sensitive cities

Aleppo, Nairobi, Barcelona Venezuela,Aden Yemen

Recovery Rate

Recovery rate can calculated as the number of infected people after peak of disease. Hence, following table shows the recovery of cities in terms of mean number of people infected after the peak of the disease in each city.

City

No. of people infected in peak week

Mean no. of people infected in recovery period(weeks after peak week) every week

Ratio (col2/col3)

Aleppo, Syria

374022

163529

2.29

Tolima, Colombia

88870

57782

1.55

Tabriz, Iran

77315

41623

1.85

Karachi, Pakistan

944140

585204

1.61

Beirut, Lebanon

51911

34256

1.52

Nairobi, Kenya

176879

100434

1.76

Jedda, Saudi Arabia

154431

118081

1.31

Nonthaburi, Thailand

10490

7296

1.43

Mesrin, Turkey

40118

23778

1.69

Barcelona, Venezuela

17748

11265

1.58

Aden, Yemen

48172

25088

1.92

It can be concluded that the cities with low ratios (Jedda Saudi Arabia, Nonthaburi Thailand) have low recovery rate as compared to other cities. Whereas, Aleppo and Aden have high recovery rate.

 

 Anomalies found :

Anomaly 1:

We studied graphs for distribution of no. of infected over week for every city. The plot of Jedda,Saudi Arabia shows anomaly as shown is Screenshot 2.

Screenshot 2 : Distribution of infected/dead people over week for Saudi Arabia. High resolution image here.

Screenshot2.2r.png

         As seen in the plot, the week with most no. of infected people is 05/07-05/13. The recovery phase for Jedda starts after this week. However, there is again a sharp rise (of 49645 infections) in the week of 06/19 – 06/25. This is an anomaly because the week shows substantial no. of new infections in the recovery phase.

Anomaly 2:

For cities Mesrin, Turkey and Nonthaburi, Thailand, the syndromes that lead to highest infected/dead people were different from other cities. The grave syndromes for most of the cities were Abdominal Pain, Vomiting and nose problems. But for Thailand and Turkey, the grave syndromes were headache and fever(see screenshot 3). Hence, these cities didnt manifest the grave syndromes of the disease.

Screenshot 3 : Syndrome distribution graph for Mesrin, Turkey showing top 3 grave syndromes and placement of ABD PAIN which the grave syndrome for other infected cities. High resolution image here.