Palantir Technologies – VAST10 Team
Brandon
Wright, Palantir Technologies, bwright@palantirtech.com
Jesse Rickard, Palantir Technologies
Alex Polit, Palantir Technologies
Jason Payne, Palantir Technologies
Overview: Palantir is a platform for collaborative, all-source analysis and operations, enabling geospatial, social-network, temporal, statistical, and structured and unstructured analysis. Palantir provides flexible tools to import and model data, intuitive constructs to search against this data, and powerful techniques to iteratively define and test hypotheses. Our platform is most highly valued for:
Background: Palantir is operational today at many of the most prestigious intelligence, defense, law enforcement, and regulation/oversight organizations in the world. Palantir was put together by the founders of PayPal, capitalizing on the lessons learned by their anti-fraud department. Facing highly coordinated cyber attacks in order to commit payment fraud and exploit sensitive consumer information, an entirely new approach was required. Existing technology was poorly suited to dealing with sparse, cyber-specific data. To defeat the international fraud rings, high level conceptual access to the data was required. The analyst-driven intelligence analysis tools that eventually became the Palantir platform were a direct outgrowth of this effort.
Company Web site:
http://www.palantirtech.com
We also used the NIH Basic Local Alignment Search Tool.
Video:
ANSWERS:
MC3.1: What is the
region or country of origin for the current outbreak? Please provide your answer as the name of the
native viral strain along with a brief explanation.
The outbreak originated from Nigeria. We used the NIH Basic Local Alignment Search Tool to align the native and outbreak sequences and provide data showing base substitutions for each sequence. We imported a structured Excel file of the data into Palantir. With nodes representing all 68 sequences on the graph, and with the native and outbreak sequence nodes colored blue and red respectively, we used an entity relationships “link by” search to add edges between the nodes based on genetic relationships. Auto-arranging the nodes creates a force-directed layout of the mutation paths. We then identified the edge linking the blue- and red-colored networks together. Nigeria_b was the likely progenitor of 531 and the source of the outbreak.
Figure MC3.1.1: Force-directed layout
of blue native and red outbreak sequences.
Figure MC3.1.2: Zoomed in view showing
that Nigeria_b is the most likely progenitor of the outbreak.
MC3.2: Over
time, the virus spreads and the diversity of the virus increases as it
mutates. Two patients infected with the
Drafa virus are in the same hospital as Nicolai. Nicolai has a strain identified by sequence
583. One patient has a strain identified
by sequence 123 and the other has a strain identified by sequence 51. Assume only a single viral strain is in each
patient. Which patient likely contracted
the illness from Nicolai and why? Please
provide your answer as the sequence number along with a brief explanation.
Patient 123 was most likely contracted the illness from Nicolai. From the tree view created for MC3.1, we used “QuickJump” to highlight the node representing sequence 583 on the graph. Zooming in to sequence 583, we immediately saw that it shared an edge with the node representing sequence 123, with 123 being a mutation of 583. Sequence 51, on the other hand, was also visible as a different mutation branch from 583’s progenitor, sequence 531. Based on the mutation paths of the virus, Nicolai was the most likely source of the infection for the patient infected with sequence 123.
Figure M3.2.1: View of
Nicolai linked to the viral strain that infected him (583), with directional
links indicating the most likely mutation path of the virus being from strain
583 to 123.
MC3.3: Signs
and symptoms of the Drafa virus are varied and humans react differently to
infection. Some mutant strains from the
current outbreak have been reported as being worse than others for the patients
that come in contact with them.
Identify the top 3
mutations that lead to an increase in symptom severity (a disease characteristic). The mutations involve one or more base
substitutions. For this question, the
biological properties of the underlying amino acid sequence patterns are not
significant in determining disease characteristics.
For each mutation
provide the base substitutions and their position in the sequence (left to
right) where the base substitutions occurred. For example,
C → G, 456 (C
changed to G at position 456)
G → A, 513 and
T → A, 907 (G changed to A at position 513 and T changed to A at position
907)
A → G, 39 (A
changed to G at position 39)
The top three base substitutions are:
T → C, 842 and A → T, 946
A → G, 223
A → C, 197
Our imported MC3.1 data included disease characteristics and base substitutions resolved to the respective Viral Sequences. With the 58 outbreak sequence nodes on the graph, we used the histogram to separate nodes into groups by symptom severity. Next we used the histogram to highlight nodes corresponding to shared mutations. We looked for mutation with more occurrences among the nodes with more severe symptoms. We also colored nodes by base substation to more easily identity selected base substitutions that substantially overlapped with the colored nodes, which we considered to be combination mutations.
Figure MC3.3.1:
Outbreak sequences separated into groups (left to right) of mild, moderate, and
severe symptoms, with histogram selection of Viral Sequences with the mutation
A to T at position 946.
MC3.4: Due to
the rapid spread of the virus and limited resources, medical personnel would
like to focus on treatments and quarantine procedures for the worst of the
mutant strains from the current outbreak, not just symptoms as in the previous
question. To find the most dangerous
viral mutants, experts are monitoring multiple disease characteristics.
Consider each
virulence and drug resistance characteristic as equally important. Identify the top 3 mutations that lead to the
most dangerous viral strains. The mutations involve one or more base
substitutions. In a worst case scenario,
a very dangerous strain could cause severe symptoms, have high mortality, cause
major complications, exhibit resistance to anti viral drugs, and target high
risk groups. For this question, the
biological properties of the underlying amino acid sequence patterns are not significant
in determining disease characteristics.
For each mutation
provide the base substitutions and their position in the sequence (left to
right) where the base substitutions occurred. For example,
C → G, 456 (C
changed to G at position 456)
G → A, 513 and
T → A, 907 (G changed to A at position 513 and T changed to A at position
907)
A → G, 39 (A
changed to G at position 39).
The three most dangerous strains were outbreak sequences 118, 123, and 501. Among these strains, the top three mutations were:
T → C, 842 and A → T, 946
A → C 269
A → C, 197 and G → C, 848
To identify these strains and mutations, we first used the National Institute of Health’s Basic Local Alignment Search Tool (BLAST - http://blast.ncbi.nlm.nih.gov/Blast.cgi) to generate data showing base substitutions and genetic similarity among the strains. Next, we restructured these results to create an Excel file of the data for import into Palantir. This restructuring using Excel took an hour of work, which was less time than what it would take to develop a custom helper in Palantir to interact directly with BLAST. Nevertheless, Palantir’s open APIs enable organizations to easily create helpers for common workflows involving third party tools such as BLAST.
Figure MC3.4.1: NIH Basic Local
Alignment Tool web interface.
Our structured data for import consisted of the original strain sequences, the disease characteristics, and mutations by strain, including base position and substitution. We imported these data using Palantir’s front-end import wizard, where users can map columns of imported data to Palantir’s ontology. The import time took less than a half minute, which we expected based on the small number of rows.
Figure MC3.4.2: Palantir Import Wizard, showing schema mapping of disease characteristics import.
Before the import, we had already used the Palantir Dynamic Ontology Manager (PDOM) to create the necessary ontology elements for this data, a simple process that takes only a few minutes. The ontology elements we added included a “Viral Sequence” object with properties for the five disease characteristics and a composite property for “mutations” consisting of a base position and substitution. We also created two directional link types: a “mutation of” and an “infected by” link.
Figure MC3.4.3: View of link creation
in PDOM.
We used a filter search in Palantir to identify the most dangerous strains based on the five disease characteristics properties. With all 58 outbreak Viral Sequence nodes on the graph, our filter search for the most dangerous rating for all five categories ghosted all 58 nodes, indicating no match. We turned to the histogram, which bins graph selections based on property counts, to iteratively select (histogram selections highlight the respective nodes on the graph) and drill down on characteristics, which revealed that strains with major complications have only up to medium at risk vulnerability and mortality.
Figure MC3.4.4: Property filter
applied to reveal most dangerous Viral Sequences.
Changing the original filter to filter for minor complications but the most dangerous rating for all other disease characteristics ghosted all but three strains, 118, 123, and 501. Histogramming the three most dangerous strains displays bins of their mutations. All three had the 842 and 946 mutations, two had mutations at 269, and the other one had a combination 197 and 848 mutation.
Figure MC3.4.5: Histogram selection of
shared mutation at position 269 on two of the three most dangerous Viral Sequences.
The
entire analytic process in Palantir took less than half an hour. To share this data, Palantir has a graph
sharing feature to share graphs with other enterprise users. Alternatively, data can be exported to Excel
or Word, and graph screenshots can be exported to PowerPoint or html.