David J. White
UrsaManor_MC3

VAST 2010 Challenge
Genetic Sequences – Tracing the Mutations of a Disease

Authors and Affiliations:

David J. White Austin Texas David.White.US@gmail.com

Tool(s):

The tools used are: Eclipse, Java, JgraphX, JgraphT, CamStudio, Avidmux and OpenOffice

The java classes are used to ingest and manipulate the initial data. The analysis is performed manually via examination of a spreadsheet, output of a diff tool, or examination of a generated graph. Very little customization is required to use for additional data sets. 

Video:

 

Video of execution and analysis.

 

 

ANSWERS:


MC3.1: What is the region or country of origin for the current outbreak?  Please provide your answer as the name of the native viral strain along with a brief explanation.

The native viral strain of the current outbreak is Nigeria_B. The SequenceReader class ingested the initial data set and compared all of the native strains with all of the current strains and generated a comma separated values file containing the number of mutations between two strains. This file was then visually inspected in OpenOffice Calc. All of the current strains are within 16 mutations of Nigeria_B.





MC3.2:  Over time, the virus spreads and the diversity of the virus increases as it mutates.  Two patients infected with the Drafa virus are in the same hospital as Nicolai.  Nicolai has a strain identified by sequence 583.  One patient has a strain identified by sequence 123 and the other has a strain identified by sequence 51.  Assume only a single viral strain is in each patient.  Which patient likely contracted the illness from Nicolai and why?  Please provide your answer as the sequence number along with a brief explanation.

During initial ingest, the SequenceReader class wrote each strain to a separate text file, with one base per line. I then used Kdiff3, an open source file and directory diff and merge tool, to visually compare the sequence files. It is likely that patient with strain 123 contracted the illness from Nicolai, as there is a single mutation between strain 583 and strain 123, and there are three mutations between strain 583 and strain 51.





MC3.3:  Signs and symptoms of the Drafa virus are varied and humans react differently to infection.  Some mutant strains from the current outbreak have been reported as being worse than others for the patients that come in contact with them. 

Identify the top 3 mutations that lead to an increase in symptom severity (a disease characteristic).  The mutations involve one or more base substitutions.  For this question, the biological properties of the underlying amino acid sequence patterns are not significant in determining disease characteristics.

For each mutation provide the base substitutions and their position in the sequence (left to right) where the base substitutions occurred. For example,

The VirusSeverityGraph class wrote out a color coded graph, with the nodes colored coded for severity. Kdiff3 was then used to determine the base substitution, and location of the mutation.

A → C, 269

A → T, 946

A → G, 223







MC3.4:  Due to the rapid spread of the virus and limited resources, medical personnel would like to focus on treatments and quarantine procedures for the worst of the mutant strains from the current outbreak, not just symptoms as in the previous question.  To find the most dangerous viral mutants, experts are monitoring multiple disease characteristics.

Consider each virulence and drug resistance characteristic as equally important.  Identify the top 3 mutations that lead to the most dangerous viral strains. The mutations involve one or more base substitutions.  In a worst case scenario, a very dangerous strain could cause severe symptoms, have high mortality, cause major complications, exhibit resistance to anti viral drugs, and target high risk groups.  For this question, the biological properties of the underlying amino acid sequence patterns are not significant in determining disease characteristics.

For each mutation provide the base substitutions and their position in the sequence (left to right) where the base substitutions occurred. For example,

C → G, 456 (C changed to G at position 456)

G → A, 513 and T → A, 907 (G changed to A at position 513 and T changed to A at position 907)

A → G, 39 (A changed to G at position 39).

My primary goal in this challenge was to write a little original code as possible. I used Open Source software for all phases of my analysis. An example of this was using Kdiff3, an open source file diff tool, to visually determine the specific base substitution and location for each mutation. The java classes are all data driven, and the same approach may be used on different data sets. The amount of effort required to implement the needed classes was under 8 hours.

The virus mutation strains were graphed with a weighting function applied to each virus with the danger of a specific strain being incremented for each characteristic which is in the highest category. The most dangerous strains are coded red. Visual examination of the graph shows that the 3 mutations that produce the most dangerous strains are

 583 => 123 A => C, 269

123 => 118 T => C, 527

333 => 501 G => C, 848