SFU-SIAT-IMAS-MC3

VAST 2010 Challenge
Genetic Sequences – Tracing the Mutations of a Disease

Authors and Affiliations:

Mahshid Z. Baraghoush, Simon Fraser University, mzeinaly@sfu.ca

Chris D. Shaw, Simon Fraser University, shaw@sfu.ca

Tool(s):

IMAS (The Interactive Multi-genomic Analysis System) is a Visual Analytics system for the discovery of knowledge in genomic information. IMAS was initially developed by Shaw and his students in 2007. IMAS is available on Sourceforge at imas.sourceforge.net under the GPL 3 license.

IMAS enables the user to load various FASTA format files. One or more sequences can then be selected to work with at a time.

IMAS visualizes the output of common bioinformatics tools such as BLAST and ClustalW in a unified framework. In this challenge, BLAST is used for pair-wise nucleotide sequence alignment, and ClustalW is use to perform NT sequence multi-alignment. Pair-wise alignment visualizes the character-by-character similarities and highlights the differences between sequences with color. IMAS also enables users to select BLAST hits and their corresponding sequences for multi-alignment. Multi-alignment results are displayed such that identical NT letters are given a background color, and NT letters different from the consensus are highlighted with a different color.

IMAS provides the user with a horizontal zooming of the sequences. This interaction assists the user in discovering patterns in the whole sequence area by controlling the level of detail. Those patterns could emphasize non-conserved regions at a glance. The user then could zoom in to a region of interest and observe more detail.

 

Video

 

 

ANSWERS:


MC3.1: What is the region or country of origin for the current outbreak?  Please provide your answer as the name of the native viral strain along with a brief explanation.

 

Nigeria_B

We assume that the native sequence which displays the most similarities to each of the current outbreak sequences is the ancestor of all the outbreak sequences.  We defined the similarity between two sequences as the number of different bases; those sequences that have the least number of base substitutions are the most similar.

We aligned each current outbreak sequence against all the countries to see which country has the most similarity to that sequence.

Figure 1 shows one BLAST run result for sequence 118 in the Current outbreak file against all the countries. The light purple areas show the similar regions and the green rectangles highlight the differences, which indicates that at least one substitution takes places at that area. (We ignored green color gradients)

 

Figure1.bmp

Figure 1: A window of 140 Nucleotides of the results of the pair-wise alignment of sequence 118 against each of the countries (fully zoomed-in)

 

Figure 2 is the same run fully zoomed out. The first sequence is Nigeria_B, which has the least number of substitutions as it has fewest highlighted areas.

 

Figure2.bmp

Figure 2:  Full-length results of pair-wise alignments of sequence 118 against each of the countries (fully zoomed-out)

 

We repeated the same steps for all of the outbreak sequences, and saw that all of them are most similar to the Nigera_B strain.

 


MC3.2:  Over time, the virus spreads and the diversity of the virus increases as it mutates.  Two patients infected with the Drafa virus are in the same hospital as Nicolai.  Nicolai has a strain identified by sequence 583.  One patient has a strain identified by sequence 123 and the other has a strain identified by sequence 51.  Assume only a single viral strain is in each patient.  Which patient likely contracted the illness from Nicolai and why?  Please provide your answer as the sequence number along with a brief explanation.

 

Sequence # 123

 

Our assumption is that the patient with the most similar sequence to sequence 583 has the highest probability of contracting the illness from Nicolai.

We used the same similarity definition as in MC3.1. This time we selected sequence 583 and aligned it against both 123 and 51.

As you see in figure 3, the sequence 123 has one highlighted area, whereas the sequence number 51 has three highlighted areas. This shows that 123 is the most similar to sequence 583.

Figure3.bmp

Figure 3: The zoomed-out view of the results of aligning sequence 583 against sequences 123 and 51

 

Figure 4 shows the same results with a fully zoomed-in view.

 

Figure4.bmp

 

 

Figure 4: Detailed view of all the substitutions we found in the pair-wise alignment of sequence 583 with sequences 123 and 51

 


MC3.3:  Signs and symptoms of the Drafa virus are varied and humans react differently to infection.  Some mutant strains from the current outbreak have been reported as being worse than others for the patients that come in contact with them. 

Identify the top 3 mutations that lead to an increase in symptom severity (a disease characteristic).  The mutations involve one or more base substitutions.  For this question, the biological properties of the underlying amino acid sequence patterns are not significant in determining disease characteristics.

For each mutation provide the base substitutions and their position in the sequence (left to right) where the base substitutions occurred. For example,

C → G, 456 (C changed to G at position 456)

G → A, 513 and T → A, 907 (G changed to A at position 513 and T changed to A at position 907)

A → G, 39 (A changed to G at position 39)

 

T -> C 842, A -> T 946

 

A -> C 269

 

G -> C 161, A -> G 223, T -> C 790

 

 

We decided to split the current outbreak sequences in to three separate sets based on the severity of their symptoms from the disease characteristics table.

The next step was to multi-align all the sequences of each set together. By multi-aligning the sequences of each set, we can see the mutations that occur within a set. Also by comparing the aligned sets, the common mutations between the three groups will be evident.

Figure 5 gives an overview of the changes between the three aligned groups.  The highlighted letters show the differences between all the sequences according to their similarity to the consensus.

 

Figure5.bmp

Figure 5: An example window of the three multi-aligned groups

 

If a mutation happens in more than 10% of the sequences of each set, then we would set its category by the following rule. Figure 6 shows these substitutions with the intervening parts of the sequences cut out of the figure.

If a substitution occurs only in the severe set, we consider it a cause of a severe symptom. (An increase from the mild and moderate groups)

If a substitution occurs in both the severe and moderate sets, we consider it a moderate mutation. (An increase from the moderate group)

If a substitution occurs in all three sets, we consider it a mild mutation. (No increase happens there)

 

Figure6.bmp

Figure 6: These base substitutions occur in more than 10% of at least one aligned group

 

 


MC3.4:  Due to the rapid spread of the virus and limited resources, medical personnel would like to focus on treatments and quarantine procedures for the worst of the mutant strains from the current outbreak, not just symptoms as in the previous question.  To find the most dangerous viral mutants, experts are monitoring multiple disease characteristics.

Consider each virulence and drug resistance characteristic as equally important.  Identify the top 3 mutations that lead to the most dangerous viral strains. The mutations involve one or more base substitutions.  In a worst case scenario, a very dangerous strain could cause severe symptoms, have high mortality, cause major complications, exhibit resistance to anti viral drugs, and target high risk groups.  For this question, the biological properties of the underlying amino acid sequence patterns are not significant in determining disease characteristics.

For each mutation provide the base substitutions and their position in the sequence (left to right) where the base substitutions occurred. For example,

C → G, 456 (C changed to G at position 456)

G → A, 513 and T → A, 907 (G changed to A at position 513 and T changed to A at position 907)

A → G, 39 (A changed to G at position 39).

 

 

 

 

Our first step was to split all the current outbreak sequences into different sets based on their degree of dangerousness, so that we could compare the groups together and find their common mutations.

In order to define a measure that could separate mutant strains, we first assigned a number to each level of the disease characteristics, as the table below shows.

 

Symptoms

Mortality

Complications

Drug-Resistance

At-Risk-Vulnerability

Mild                 0

Low                0

Minor   0

Susceptible        0

Low                0

Moderate        1

Medium          1

 

Intermediate       1

Medium          1

Severe            2

High                2

Major   2

Resistant            2

High                2

 

We then added all the characteristic levels and assigned a number to each of the current outbreak sequences that we believed shows the danger level of that sequence. For example, the characteristic levels of sequence 2 are Mild Symptoms=0, Medium Mortality=1, Minor Complications=0, Intermediate Drug-Resistance=1, and Medium At-Risk=1; we assigned 3 as its level of danger. We assumed that each disease characteristic has equal impact.

This summing of characteristic levels results in danger ratings between 1 and 8 per sequence. Thus we divided all the sequences into 8 groups.  The group with danger level 8 is the most dangerous group, containing sequences 118, 123, 202, 211, 501, and 705.

We then multi-aligned all the sequences of each group. We thus have 8 multi-aligned groups with these labels: Danger Level 8: the most dangerous mutation strains), down to Danger level 1, the least dangerous mutation strains.

The consensus sequence of a multi-alignment displays the most common letter at each multi-alignment position. IMAS highlights the letters that are different from the consensus sequence in each multi-alignment. The dark purple shows that all the sequences have the same characters and that no mutation occurs in that area.

Figure7 shows a snapshot of the fully zoomed-out multi-alignments of our 8 groups. The groups are sorted from the most dangerous group to the least dangerous group. The example shown in this figure contains 355 Nucleotides, from position 10 to 365. By looking at the fully zoomed-out picture we might see some patterns and get a primary idea of the different positions. By zooming in, we can go to those positions and get a more detailed view, making interpretation easier. For example at position 223, there is an increase in the number of changes in the most dangerous groups.

 

Figure7.bmp

Figure 7: Some patterns are evident in a fully zoomed-out view

 

We make the following assumptions about the impact of Nucleotide substitutions:

If a substitution in a particular place happens in more than 10% of the sequences of at least two of the top 3 most dangerous groups, then we record this substitution for further consideration.

In figure 8 we show all substitutions that meet this criterion. The vertical line and location label shows these substitutions.

 

Figure8.bmp

Figure 8: These mutations happened in more than 10% of more than one group in the top three dangerous groups

 

 

We consider these substitutions in order of importance.

The most important substitutions occur at positions 842 and 946. These substitutions occur more commonly in the most dangerous groups. These substitutions also occur together in the most dangerous groups. We will name this pair of substitutions as mutation 1.

However, in dangerous groups where mutation 1 does not occur, there is another pair of substitutions that occur in tandem: positions 161 and 790. We call this substitution pair mutation 2. Each sequence in danger group 8 has either mutation 1 or mutation 2, but not both.

Which mutation is more powerful? We looked at the diseases characteristics table and categorized the sequences in terms of individual disease characteristics. Danger group 8 sequences with Mutation 1 are 118, 123, and 501. These three sequences have four top-level disease characteristics: Symptoms, Mortality, Drug Resistance, and At-risk vulnerability. Mutation 2 sequences 202, 211, and 705 have three top-level dangers: 202 and 705 have high Symptoms, Complications and Drug Resistance, while 211 has high Mortality, Complications and Drug Resistance. Thus, mutation 1 is the more dangerous of the two.

To get more evidence we looked at the Danger level 7 group. Mutation 1 sequences 197, 248, 583, 876, and 997 have 3 high disease characteristics. Mutation 2 sequences 253 and 895 have 2 high disease characteristics. Two of the three remaining sequences in danger group 7 have mutation 2 with 3 high disease characteristics (418 and 952). The weight of evidence points to mutation 1 as the most dangerous.

The last sequence in danger group 7 (842) has neither mutation 1 nor mutation 2. We observed that this sequence has a unique substitution at location 542. Since 842 has high Mortality, Drug Resistance and At-Risk Vulnerability, mutation 3 is this single base substitution. Sequence 842 also has substitutions at locations 322 and 955, but these substitutions also occur in sequences 209, 79 and 186, which are at danger levels 5, 4 and 3 respectively. This suggests that locations 322 and 955 have less danger than the unique substitution at location 542.

The remaining substitutions are associated more strongly with sequences in medium-danger groups, so we deemed these substitutions as less important. Figure 9 shows the substitutions occur for sequence 842.

 

Figure9.bmp

Figure 9: Substitutions happen in sequence 842

 

Mutation 1: A -> T 946, T -> C 842, 
Mutation 2: T -> C 790, G -> C 161,
Mutation 3: A -> C 542
 
Although the impact of Mutations 1 and 2 tend to lead to severe danger, this does not mean that these mutations guarantee high danger.  Our conclusions are thus tempered by the range of severities of these mutations.