Palantir Technologies -- VAST 2010 Challenge -- MC3

Palantir Technologies

VAST 2010 Challenge
Genetic Sequences – Tracing the Mutations of a Disease

Authors and Affiliations:

Palantir Technologies – VAST10 Team
Brandon Wright, Palantir Technologies, bwright@palantirtech.com

Jesse Rickard, Palantir Technologies

Alex Polit, Palantir Technologies

Jason Payne, Palantir Technologies

Tool(s):

Overview: Palantir is a platform for collaborative, all-source analysis and operations, enabling geospatial, social-network, temporal, statistical, and structured and unstructured analysis. Palantir provides flexible tools to import and model data, intuitive constructs to search against this data, and powerful techniques to iteratively define and test hypotheses. Our platform is most highly valued for:

Quick integration of Enterprise data sources. Why should analysts have to query each data source individually?
Simple, intuitive search and discovery. Why should analysts have to understand schemas and query languages?
Extensibility. Palantir is highly adaptable through extensibility, enabling new integrations and visualizations to be developed in a matter of hours, not days.
Open and interoperable with other toolkits. Many analysts have an established set of tools they rely on, which is why Palantir was built to interoperate.
Collaboration. Palantir facilitates collaboration across users, groups, and agencies.

Background: Palantir is operational today at many of the most prestigious intelligence, defense, law enforcement, and regulation/oversight organizations in the world. Palantir was put together by the founders of PayPal, capitalizing on the lessons learned by their anti-fraud department. Facing highly coordinated cyber attacks in order to commit payment fraud and exploit sensitive consumer information, an entirely new approach was required. Existing technology was poorly suited to dealing with sparse, cyber-specific data. To defeat the international fraud rings, high level conceptual access to the data was required. The analyst-driven intelligence analysis tools that eventually became the Palantir platform were a direct outgrowth of this effort.

Company Web site:
http://www.palantirtech.com

We also used the NIH Basic Local Alignment Search Tool.

Video:

MC3.wmv

ANSWERS:

MC3.1: What is the region or country of origin for the current outbreak? Please provide your answer as the name of the native viral strain along with a brief explanation.

The outbreak originated from Nigeria. We used the NIH Basic Local Alignment Search Tool to align the native and outbreak sequences and provide data showing base substitutions for each sequence. We imported a structured Excel file of the data into Palantir. With nodes representing all 68 sequences on the graph, and with the native and outbreak sequence nodes colored blue and red respectively, we used an entity relationships “link by” search to add edges between the nodes based on genetic relationships. Auto-arranging the nodes creates a force-directed layout of the mutation paths. We then identified the edge linking the blue- and red-colored networks together. Nigeria_b was the likely progenitor of 531 and the source of the outbreak.

Figure MC3.1.1: Force-directed layout of blue native and red outbreak sequences.

Figure MC3.1.2: Zoomed in view showing that Nigeria_b is the most likely progenitor of the outbreak.

MC3.2: Over time, the virus spreads and the diversity of the virus increases as it mutates. Two patients infected with the Drafa virus are in the same hospital as Nicolai. Nicolai has a strain identified by sequence 583. One patient has a strain identified by sequence 123 and the other has a strain identified by sequence 51. Assume only a single viral strain is in each patient. Which patient likely contracted the illness from Nicolai and why? Please provide your answer as the sequence number along with a brief explanation.

Patient 123 was most likely contracted the illness from Nicolai. From the tree view created for MC3.1, we used “QuickJump” to highlight the node representing sequence 583 on the graph. Zooming in to sequence 583, we immediately saw that it shared an edge with the node representing sequence 123, with 123 being a mutation of 583. Sequence 51, on the other hand, was also visible as a different mutation branch from 583’s progenitor, sequence 531. Based on the mutation paths of the virus, Nicolai was the most likely source of the infection for the patient infected with sequence 123.

Figure M3.2.1: View of Nicolai linked to the viral strain that infected him (583), with directional links indicating the most likely mutation path of the virus being from strain 583 to 123.

MC3.3: Signs and symptoms of the Drafa virus are varied and humans react differently to infection. Some mutant strains from the current outbreak have been reported as being worse than others for the patients that come in contact with them.

Identify the top 3 mutations that lead to an increase in symptom severity (a disease characteristic). The mutations involve one or more base substitutions. For this question, the biological properties of the underlying amino acid sequence patterns are not significant in determining disease characteristics.

For each mutation provide the base substitutions and their position in the sequence (left to right) where the base substitutions occurred. For example,

C → G, 456 (C changed to G at position 456)

G → A, 513 and T → A, 907 (G changed to A at position 513 and T changed to A at position 907)

A → G, 39 (A changed to G at position 39)

The top three base substitutions are:

T → C, 842 and A → T, 946

A → G, 223

A → C, 197

Our imported MC3.1 data included disease characteristics and base substitutions resolved to the respective Viral Sequences. With the 58 outbreak sequence nodes on the graph, we used the histogram to separate nodes into groups by symptom severity. Next we used the histogram to highlight nodes corresponding to shared mutations. We looked for mutation with more occurrences among the nodes with more severe symptoms. We also colored nodes by base substation to more easily identity selected base substitutions that substantially overlapped with the colored nodes, which we considered to be combination mutations.

Figure MC3.3.1: Outbreak sequences separated into groups (left to right) of mild, moderate, and severe symptoms, with histogram selection of Viral Sequences with the mutation A to T at position 946.

MC3.4: Due to the rapid spread of the virus and limited resources, medical personnel would like to focus on treatments and quarantine procedures for the worst of the mutant strains from the current outbreak, not just symptoms as in the previous question. To find the most dangerous viral mutants, experts are monitoring multiple disease characteristics.

Consider each virulence and drug resistance characteristic as equally important. Identify the top 3 mutations that lead to the most dangerous viral strains. The mutations involve one or more base substitutions. In a worst case scenario, a very dangerous strain could cause severe symptoms, have high mortality, cause major complications, exhibit resistance to anti viral drugs, and target high risk groups. For this question, the biological properties of the underlying amino acid sequence patterns are not significant in determining disease characteristics.

For each mutation provide the base substitutions and their position in the sequence (left to right) where the base substitutions occurred. For example,

C → G, 456 (C changed to G at position 456)

G → A, 513 and T → A, 907 (G changed to A at position 513 and T changed to A at position 907)

A → G, 39 (A changed to G at position 39).

The three most dangerous strains were outbreak sequences 118, 123, and 501. Among these strains, the top three mutations were:

T → C, 842 and A → T, 946

A → C 269

A → C, 197 and G → C, 848

To identify these strains and mutations, we first used the National Institute of Health’s Basic Local Alignment Search Tool (BLAST - http://blast.ncbi.nlm.nih.gov/Blast.cgi) to generate data showing base substitutions and genetic similarity among the strains. Next, we restructured these results to create an Excel file of the data for import into Palantir. This restructuring using Excel took an hour of work, which was less time than what it would take to develop a custom helper in Palantir to interact directly with BLAST. Nevertheless, Palantir’s open APIs enable organizations to easily create helpers for common workflows involving third party tools such as BLAST.

Figure MC3.4.1: NIH Basic Local Alignment Tool web interface.

Our structured data for import consisted of the original strain sequences, the disease characteristics, and mutations by strain, including base position and substitution. We imported these data using Palantir’s front-end import wizard, where users can map columns of imported data to Palantir’s ontology. The import time took less than a half minute, which we expected based on the small number of rows.

Figure MC3.4.2: Palantir Import Wizard, showing schema mapping of disease characteristics import.

Before the import, we had already used the Palantir Dynamic Ontology Manager (PDOM) to create the necessary ontology elements for this data, a simple process that takes only a few minutes. The ontology elements we added included a “Viral Sequence” object with properties for the five disease characteristics and a composite property for “mutations” consisting of a base position and substitution. We also created two directional link types: a “mutation of” and an “infected by” link.

Figure MC3.4.3: View of link creation in PDOM.

We used a filter search in Palantir to identify the most dangerous strains based on the five disease characteristics properties. With all 58 outbreak Viral Sequence nodes on the graph, our filter search for the most dangerous rating for all five categories ghosted all 58 nodes, indicating no match. We turned to the histogram, which bins graph selections based on property counts, to iteratively select (histogram selections highlight the respective nodes on the graph) and drill down on characteristics, which revealed that strains with major complications have only up to medium at risk vulnerability and mortality.

Figure MC3.4.4: Property filter applied to reveal most dangerous Viral Sequences.

Changing the original filter to filter for minor complications but the most dangerous rating for all other disease characteristics ghosted all but three strains, 118, 123, and 501. Histogramming the three most dangerous strains displays bins of their mutations. All three had the 842 and 946 mutations, two had mutations at 269, and the other one had a combination 197 and 848 mutation.

Figure MC3.4.5: Histogram selection of shared mutation at position 269 on two of the three most dangerous Viral Sequences.

The entire analytic process in Palantir took less than half an hour. To share this data, Palantir has a graph sharing feature to share graphs with other enterprise users. Alternatively, data can be exported to Excel or Word, and graph screenshots can be exported to PowerPoint or html.