Prashant Chaudhary, College of Engineering, Pune [Primary Contact] Email : prash.c.29@gmail.com
Sonali Rahagude, College of Engineering, Pune Email : sonalirahagude@gmail.com
Gaurish Chaudhari, College of Engineering, Pune Email : gsc.chaudhari@gmail.com
Mrs. Vahida Attar, College Engineering, Pune [Faculty Advisor] Email : vahida.comp@coep.ac.in
We developed a tool to analyze preprocessed data specifically for Mini challenge2. The tool is an open source tool built in Java. It gives values of various factors using filters of city and syndromes on the data set. It shows plots of the processed data also. These graphs are drawn using open source graph plotting software – GNUplot (http://www.gnuplot.info/). The tool analyzes the preprocessed data in variety of ways. These include :
1. City-wise analysis
Analysis for each city can done which includes :
- Distribution of no. of dead/infected over week.(Shows a graph blue-infected, red- dead)
- Distribution of no. of dead/infected over age group. (Shows a graph blue-infected, red-dead)
Furthermore, a particular city may be analyzed for a particular syndrome which includes :
§ Shows the graph for syndromes vs no of people dead/infected. Top 3 most grave syndromes are labeled and the syndrome selected presently is also labeled.
§ Gives the values for following terms:
- No. of males infected/dead by the syndrome
- No. of females infected/dead by the syndrome
- Age group peak infected/dead by the syndrome
- Week peak infected/dead by the syndrome
2. Overall analysis
This includes analysis for the entire data set :
- Distribution of no. of dead/infected over week(Shows a graph blue-infected, red- dead)
- Distribution of no. of dead/infected over age groups(Shows a graph blue-infected, red-dead)
Also, a syndrome wise classification for the entire data set is also provided. It includes
§ Shows the graph for syndromes vs no of people dead/infected. Top 3 most grave syndromes are labeled and the syndrome selected presently is also labeled.
§ Gives the values for following terms:
- Total no. of infected/dead
- Age group peak infected/dead
- Peak week infected/dead
Video:
here is a short video describing analysis and the use of our tool
ANSWERS:
MC2.1: Analyze the records you have been given to characterize the spread of the disease. You should take into consideration symptoms of the disease, mortality rates, temporal patterns of the onset, peak and recovery of the disease. Health officials hope that whatever tools are developed to analyze this data might be available for the next epidemic outbreak. They are looking for visualization tools that will save them analysis time so they can react quickly.
1. DATA PREPROCESSING :
1.1 Integration of admittance and death files
To work on a single file rather than 2, we first integrated the separate 2 files of admittance and death records and create a new file by adding a new attribute STAT (dead/alive) and DEATHDATE . JAVA code (Integrate.java) was written to read do this.
1.2 Preprocessing of Symptoms
Same symptom was represented in different format. Eg. ABDOMINAL PAIN represented as ABD PAIN or ABD. PAIN or ABD PX etc. Our goal was to remove different representations of same symptom. This would help to reduce biasing during classification.
The following procedure was followed :
STEP 1 :
We wrote a code in Java (Symtom.java) to give distinct symptoms in given file. The result of above code was written in a new file which contained only the distinct symptoms.
STEP 2 :
The symptoms were analyzed. We decided on a common representation of the different representations of same symptom . Eg. ABD PAIN was decided for all representations of abdominal pain.
STEP 3 :
Now, the above common representation was added in front of all the representations. Eg. for abdominal pain the entries looked like :
ABDOMINAL PAIN#ABD PAIN
ABD.PAIN#ABD PAIN
ABD PX#ABD PAIN
STEP 4 :
A new JAVA code (Findreplace.java) was created to read above entries and update them. Again we ran Symtom.java on the new file. Now, The o/p file contained less distinct symptoms. We followed this procedure for 10 versions and finally the number of distinct symptoms was reduced from approx. 1312 to 77. We could now use these 77 distinct syndromes for classification.
1.2 INTEGRATING INFORMATION FROM ALL CITIES
The analysis was done to be done for overall data set while the data set provided contained information city-wise. Hence corresponding files of all cities were combined into one single file (all.csv). Codes as described below were then run on this single file :
CODES :
i) Classify_age.java
Classify the data(infected/dead) according to the age groups and store it in a particular file. (all_age.csv)
ii) Classify_week.java
Classify the data(infected/dead) according to the weeks and store it in a particular file. (all_week.csv)
iii) Syndrome_classify.java
Classify the data according to syndromes. Store result in all_sym.csv
iv) Syndrome_classify_week.java
Classify the data according to syndromes and weeks. Store result in all_sym_week.csv
iv) Syndrome_classify_age.java
Classify the data according to syndromes and age groups. Store result in all_sym_age.csv
This initial classification of data helped us create files pertaining to various factors such as syndromes, age group, week etc. These could now to be directly used for analysis and are used by our tool.
2. ANALYSIS FOR CHARACTERIZATION OF SPREAD OF DISEASE
We now used our tool to analyze preprocessed data.
To characterize the spread of the disease :
2.1 We first accounted for classification acc. to syndromes :
- By studying the graphs counted the no of infected and dead males and females for the particular syndrome.
- We found out the age-group that had maximum infected patients and the one that had maximum dead patients for this syndrome. This will help the officials find which age group is more prone to the disease.
- Next, we found out the week when there were maximum no of admittance as also when there were deaths exclusively for the selected syndrome. This is useful in finding the temporal pattern of the disease for that particular syndrome.
Screenshot 1 & 2: It shows the data analyzed for a particular syndrome selected. It also gives a plot of no. of dead/infected to syndromes. The selected syndrome is marked with an arrow. The various syndromes for people infected by the disease are compared in the graph. From this graph, we could determine syndromes with high no. of dead people. Thus, the grave symptoms were found out.
Screenshot 1 : Shows the statistics for the syndrome of back pain. High resolution image here.
Screenshot 2 : Shows no. of infected & dead for various syndromes. High resolution image here.
List
of grave symptoms
SYNDROME |
NO. OF DEAD PEOPLE |
NO. OF AFFECTED PEOPLE |
ABD pain |
70987 |
943441 |
Vomiting |
63426 |
897332 |
Vomiting, diarrhea |
55983 |
631516 |
Back pain |
41177 |
628392 |
Vomiting, ABD pain |
33393 |
339066 |
Nose problems |
18511 |
291520 |
Diarrhea |
11336 |
170182 |
2.2 Classification was done according to mortality rate :
We calculated mortality rate in 2 ways: a) no of deaths per age group
b) no of deaths per week
A plot of both was drawn. Using these, mortality rate was found out for each group and listed. Mortality rate for each week during the pandemic was also listed. This gave us an idea of age group with high mortality rate. We could also infer the time in the epidemic period when there were maximum no. of casaulties. Screenshots 3 and 4 depict the results.
list of mortality rate (age-group wise)
AGE GROUP |
NO. OF DEATHS |
0-9 |
397 |
10-19 |
4390 |
20-29 |
28975 |
30-39 |
92569 |
40-49 |
128740 |
50-59 |
78162 |
60-69 |
21197 |
70-79 |
2786 |
80-89 |
253 |
90-99 |
0 |
list of mortality rates (week wise)
WEEK |
NO. OF DEATHS |
04/16 - 04/22 |
382 |
04/23 - 04/29 |
1112 |
04/30 - 05/06 |
6141 |
05/07 - 05/13 |
37130 |
05/14 - 05/20 |
79080 |
05/21 - 05/27 |
115790 |
05/28 - 06/04 |
84384 |
06/05 – 06/11 |
23488 |
06/12 – 06/18 |
6785 |
06/19 – 06/25 |
2262 |
06/26 – 07/02 |
915 |
2.3 The temporal patterns for disease were calculated:
We calculated the no. of infected/dead people week-wise. This plot gave us an idea of the spread of the disease on the onset, peak and recovery phases ie. the entire duration for which the epidemic lasted. This is depicted is Screenshot 4. A list shows the weeks included in the onset, peak and recovery phases of the pandemic.
PHASE |
WEEK |
NO. OF INFECTED |
ONSET |
04/16 - 04/22 |
1149488 |
ONSET |
04/23 - 04/29 |
1119045 |
ONSET |
04/30 - 05/06 |
1404034 |
PEAK |
05/07 - 05/13 |
1836814 |
PEAK |
05/14 - 05/20 |
1930903 |
PEAK |
05/21 - 05/27 |
1630616 |
PEAK |
05/28 - 06/04 |
1462458 |
RECOVERY |
06/05 – 06/11 |
1289248 |
RECOVERY |
06/12 – 06/18 |
990430 |
RECOVERY |
06/19 – 06/25 |
1099755 |
RECOVERY |
06/26 – 07/02 |
631157 |
Screenshot 3 : This screenshot shows the plot of infected/dead vs age group. High resolution image here.
Screenshot 4 : This screenshot is a plot of no. of dead/infected vs weeks. High resolution image here.
MC2.2: Compare the outbreak across cities. Factors to consider include timing of outbreaks, numbers of people infected and recovery ability of the individual cities. Identify any anomalies you found.
1. DATA PREPROCESSING :
In this case, data classified on the basis of city was required. Hence, the codes mentioned in the previous answer such as Classify_age.java, Syndrome_classify.java etc. were run on the individual city files and a set of new files pertaining to each city was formed in a different folder. These included :
cityname_sym.csv
It lists the no. of dead and infected, syndrome-wise
cityname_age.csv
It lists the no. of dead and infected, age-group-wise
cityname_week.csv
It lists the no. of dead and infected, week-wise
cityname_sym_week.csv
It lists the no. of infected with a particular syndrome, week-wise
cityname_sym_age.csv
It lists the no. of infected with a particular syndrome, age-grp wise
cities
It lists no. of infected and dead, city-wise
2. COMPARING OUTBREAK ACROSS CITIES
We now needed to compare figures of all cities. We compared the outburst across cities on the grounds of various factors as mentioned below :
2.1 Cities were compared on basis of time of outburst
We treated the timing of outburst of the epidemic to be the time when most of the people were infected with the disease. Hence, we calculated peak infected week for each city. A list is given below.
CITY NAME |
PEAK INFECTED WEEK |
Aleppo, Syria |
05/14 – 05/20 |
Tolima, Colombia |
05/14 – 05/20 |
Tabirz, Iran |
05/21 – 05/27 |
Karachi, Pakistan |
05/14 – 05/20 |
Beirut, Lebanon |
05/28 – 06/04 |
Nairobi, Kenya |
05/14 – 05/20 |
Jedda, Saudi Arabia |
05/07 – 05/13 |
Nonthaburi. Thailand |
05/28 – 06/04 |
Mersin, Turkey |
05/28 – 06/04 |
Barcelona, Venezuela |
05/14 – 05/20 |
Aden, Yemen |
05/14 – 05/20 |
2.2 Cities were compared on basis no. of infected people
We made a plot of cities versus the no. of infected people. This plot is shown in Screenshot 1. This helped us compare the intensity of the epidemic of the cities i.e. which cities were more affected because of the epidemic. A list is provided below.
Screenshot 1 : City-wise distribution of infected/dead people. High resolution image here.
Cities
most affected
CITY NAME |
NO. OF INFECTED PEOPLE |
Karachi, Pakistan |
7154925 |
Aleppo, Syria |
2242648 |
Jedda, Saudi Arabia |
1327563 |
2.3
Recovery ability of individual cities were found out
Ability :
We calculated the
ratio of no. of dead to no. of infected people for individual cities and compared
them. This ratio was used to estimate as the recovery ability of a city. So, a
city with less ratio had recovered better. It helped us find out cities that
were more immune to the epidemic and cities that were sensitive to the
epidemic. A list of cities and their ratio is provided below.
CITY |
RATIO (DEAD / INFECTED) |
Mersin, Turkey |
0.00092 |
Nonthaburi, Thailand |
0.00098 |
Jedda, Saudi Arabia |
0.01622 |
Beirut, Lebanon |
0.01733 |
Tabriz, Iran |
0.02196 |
Karachi, Pakistan |
0.02314 |
Tolima, Colombia |
0.02317 |
Aden, Yemen |
0.02555 |
Barcelona, Venezuela |
0.02585 |
Nairobi. Kenya |
0.03405 |
Aleppo, Syria |
0.03508 |
List of immune cities
Nonathaburi Thailand, Mesrin Turkey
List of sensitive cities
Aleppo, Nairobi, Barcelona Venezuela,Aden Yemen
Recovery Rate
Recovery
rate can calculated as the number of infected people after peak of disease.
Hence, following table shows the recovery of cities in terms of mean number of
people infected after the peak of the disease in each city.
City |
No. of people infected in peak week |
Mean no. of people infected in recovery period(weeks after peak week) every week |
Ratio (col2/col3) |
Aleppo, Syria |
374022 |
163529 |
2.29 |
Tolima, Colombia |
88870 |
57782 |
1.55 |
Tabriz, Iran |
77315 |
41623 |
1.85 |
Karachi, Pakistan |
944140 |
585204 |
1.61 |
Beirut, Lebanon |
51911 |
34256 |
1.52 |
Nairobi, Kenya |
176879 |
100434 |
1.76 |
Jedda, Saudi Arabia |
154431 |
118081 |
1.31 |
Nonthaburi,
Thailand |
10490 |
7296 |
1.43 |
Mesrin, Turkey |
40118 |
23778 |
1.69 |
Barcelona,
Venezuela |
17748 |
11265 |
1.58 |
Aden, Yemen |
48172 |
25088 |
1.92 |
It can be concluded that the cities with low ratios (Jedda Saudi Arabia, Nonthaburi Thailand) have low recovery rate as compared to other cities. Whereas, Aleppo and Aden have high recovery rate.
Anomalies found :
Anomaly 1:
We studied graphs for distribution of no. of infected over week for every city. The plot of Jedda,Saudi Arabia shows anomaly as shown is Screenshot 2.
Screenshot
2 : Distribution
of infected/dead people over week for Saudi Arabia. High
resolution image here.
As seen in the plot, the week with most no. of infected people is 05/07-05/13. The recovery phase for Jedda starts after this week. However, there is again a sharp rise (of 49645 infections) in the week of 06/19 – 06/25. This is an anomaly because the week shows substantial no. of new infections in the recovery phase.
Anomaly 2:
For cities Mesrin, Turkey and Nonthaburi, Thailand, the syndromes that lead to highest infected/dead people were different from other cities. The grave syndromes for most of the cities were Abdominal Pain, Vomiting and nose problems. But for Thailand and Turkey, the grave syndromes were headache and fever(see screenshot 3). Hence, these cities didnt manifest the grave syndromes of the disease.
Screenshot 3 : Syndrome distribution graph for Mesrin, Turkey showing top 3 grave syndromes and placement of ABD PAIN which the grave syndrome for other infected cities. High resolution image here.