CancerEvolutionVisualization (CEV) creates customizable, publication quality plots for representing tumour evolution data. This guide will focus on phylogentic tree visulaization using CEV. For simple plots, this package will handle most settings right out of the box. However, more complex plots may require some trial and error to achieve the right arrangement of nodes and branches.
This guide will show best practices for creating plots, as well as examples of common use cases and tips for refining plot settings.
To install the latest version from CRAN, run the following command in R:
There are many methods for determining subpopulations within genomic data, and you should be free to use whatever method you prefer for a given dataset. This package only handles visualization - not analysis. Therefore, data must be prepared and formatted before being passed to any CEV functions.
The input for phylogenetic tree visualization is a data frame where each row defines a parent-child relationship between 2 subclones. To load the data required for this user guide, run the following code:
The simple.example
contains an example of a simple tree
with 4 nodes while the complex.example
contains a more
complex tree with 25 nodes. Both data frames contain a tree
data frame and the simple.example
also contains a
text
dataframe. The tree
component contains
the tree data, while the text
component contains the text
annotations.
The simple.example
tree data frame contains
informationof the tree structure as well as aesthetic node-by-node
customization settings (colours, edge type, etc.). The text
data frame contains text annotations for each node.
parent | length.1 | length.2 | angle | CP | node.col | node.label.col | border.col | border.type | border.width | edge.col.1 | edge.type.1 | edge.width.2 | edge.col.2 | polygon.col |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
NA | 12 | 850 | NA | 1.00 | white | black | black | NA | NA | NA | NA | 2 | green4 | NA |
1 | 10 | 1000 | NA | 0.40 | blue2 | white | white | dotted | 2 | blue2 | dashed | 2 | green4 | NA |
1 | 15 | 1100 | 90 | 0.23 | NA | NA | NA | NA | NA | NA | NA | 2 | green4 | orange |
2 | 10 | 760 | -70 | 0.31 | NA | lightblue | lightblue | dotted | 3 | lightblue | dotted | 2 | green4 | NA |
name | node | col | fontface |
---|---|---|---|
GENE1 | 2 | red | plain |
GENE2 | 2 | black | plain |
GENE3 | 2 | blue | NA |
GENE4 | 3 | NA | italic |
GENE5 | 3 | red | plain |
The simplest input format is a column containing the parent node of
each individual node. By default, the row index is assigned as
node.id
. Each node is restricted to one parent. The root
node will not have a parent, so a value of NA
is used. To
plot the tree, we can use the SRCGrob
function. This
function will return a grob
object that can be passed to
grid.draw
to render the plot.Alternatively, we provided a
wrapper function create.phylogenetic.tree
that will
automatically render the plot or save the plot into a TIFF, PNG, PDF or
SVG file.
node.id
, parent
and
label
columnsWith the minimal input, the tree will be rendered with numeric node
labels, corresponding to the row index (default node.id
). A
node.id
column can be included in the input data frame if
the IDs reported in the parent
column does not correspong
to row indexes.
node.id <- data.frame(
node.id = as.character(c(2, 5, 6, 1)),
parent = as.character(c(NA, 2, 2, 5))
);
node.id.tree <- create.phylogenetic.tree(node.id);
By default the node.id
will be used to label the nodes.
To customize node labels, a label
column can be included in
the input data frame to override the node.id
values.
It’s common to associate branch lengths with a the values of a
particular variable (for example, PGA or SNVs). Up to two branch lengths
can be specified. Including a length.1
and/or
length.2
column in the tree dataframe will enable this
branch scaling behaviour, and automatically adding a corresponding
y-axis. Specifying multiple length columns will result in multiple
(distinctly coloured) parallel lines. For each branch, the next node
will be placed at the end of the longest line.
Branches are scaled automatically, but users can further scale each
branch with the scale1
and scale2
parameters.
These values scale each branch proportionally, so
scale1 = 1.5
would make the first set of branch lengths 50%
longer.
The y-axis are automatically generated and lengths of different sizes
are scaled to fit the plot. The y-axis labels can be customized by
specifying the yaxis1.label
and yaxis2.label
parameters.
The default axis tick positions can be overridden with the
yat
parameter. This expects a list of vectors, each
corresponding to the ticks on the y-axis.
Alternatively, the y-axis can be replaced with a scale bar. The
scale.bar = TRUE
parameter will add a scale bar to the
plot, replacing the y-axis. The scale bar will be placed at the top of
the plot, and the y-axis will be removed. To further customize the scale
bar postion and size, users can use the following parameters: -
scale.bar.coords
specifies the relative x and y coordinates
of the scale bar. Both values should range from 0 to 1. -
scale.size.{1,2}
specifies the size of the scale bar if the
default is unsatisfactory. - scale.padding
specifies the
padding between the scale bars if multiple scale bars are present.
A CP
column containing the cellular prevalence or cancer
cell fraction (CCF) of each subclone can be added to the input tree
dataframe. These values typically range between 0 and 1, and the sum of
all child nodes must not be larger than their parent node’s value.
Whether you are using ‘CCF, ’CP’ o@Opeioc10!2022 r any other metric, make sure the
x-axis label matches the metric used.
CP <- simple.example$tree[, c('parent', 'length.1', 'length.2', 'CP')];
CP.default.tree <- create.phylogenetic.tree(
CP,
xaxis.label = 'CCF'
);
To control the overall scale of the polygons, users can modify the
polygon.scale
parameters. The
polygon.colour.scheme
parameter can be used to specify a
colour palette for the polygons. When a single colour is provided, a
light-to-dark gradient will be generated based on the given colour. If
multiple colours are provided, the gradient will transition between the
given colours. An optional polygon.col
column can be
included in the tree input to override the polygon colour scheme.
Polygon transparency can be specified in the tree input dataframe
using the polygon.alpha
CEV provides several methods for refining the spacing and arrangement of a tree’s nodes. This is especially useful in complex trees, which often require more attention to avoid visual problems such as node collisions and uneven branch/level spacing. Here, we see a tree with many issues.
Consider this example tree.
An optional spread
column can be included in the input
tree data.frame. Spread operates relatively as a percentage of
the initial angle calculation.
spread
value of 1 or NA
will leave the
spacing unchanged.spread
value greater than 1 will increase
the space between nodes. For example, a spread
value of
1.25 will spread the nodes 25% more.spread
value less than 1 will decrease the
space between nodes. For example, a spread
of 0.85 will
spread the nodes 15% less.To create more space for the numerous nodes in the lower levels of our example tree, we can increase the spread of the nodes at the top level. To create even more room, we can decrease the spread of some lower level nodes where appropriate.
spread.tree.input <- complex.tree.input;
spread.tree.input$spread <- 1;
spread.tree.input$spread[2:5] <- c(2, 2, 3, 2);
spread.tree.input$spread[c(6:7, 17:18, 24:25)] <- 0.5;
spread.tree.input$spread[c(8:10, 18:21)] <- 0.75;
spread.tree.input$spread[c(11:16)] <- 1.75;
spread.tree <- create.phylogenetic.tree(spread.tree.input);
Alternative, an angle
column can be specified to
manually set the angle of each node. Angels are specified in degrees,
where 0 points opposite from the parent edge. Angles can be provided in
radians when use.radians = TRUE
.
When angle
and spread
are both specified,
angle
will take precedence.
CEV currently supports two modes for visualizing phylogenetic trees:
radial and dendrogram. The default mode is radial, but users can switch
to dendrogram mode by setting the mode
column to
dendrogram
.
This mode spreads nodes out radially from the root node. Examples for plotting and customizing radial trees have been shown in the previous sections.
CEV gives the user control over numerous visual aspects of the tree.
By specifying optional columns and values in the tree input
data.frame
, the user has individual control of the colour,
width, and line type of each node, label border, and edge.
Style | Column | Defaults |
---|---|---|
Node presence | draw.node |
TRUE |
Node Label | label |
node.id column |
Node Colour | node.col |
white |
Node Label Colour | node.label.col |
black |
Node Border Colour | border.col |
black |
Node Border Width | border.width |
1 |
Node Border Line Type | border.type |
solid |
Node Size | node.size |
1 |
Edge Colour | edge.col.1 , edge.col.2 |
black, green |
Edge Width | edge.width.1 , edge.width.2 |
3, 3 |
Edge Line Type | edge.type.1 , edge.type.2 |
solid, solid |
Connector Colour | connector.col |
black |
Connector Width | connector.width |
3 |
Connector Line Type | connector.type |
solid |
Default values replace missing columns and NA
values,
allowing node-by-node, and edge-by-edge control as needed. Connector
parameters are set only when dendrogram mode is used. For sparsely
defined values (for example, only specifying a single edge), it can be
convenient to initialize a column with NA
s, then manually
assign specific nodes as needed.
Valid values for line type columns are based on lattice’s values (with some additions and differences).
Line Type |
---|
NA |
'none' |
'solid' |
'dashed' |
'dotted' |
'dotdash' |
'longdash' |
'twodash' |
Complex trees may benefit from simpler visual styles. For example,
there may not be room to render the node ellipses. CEV provides
node-by-node control with the draw.node
column.
nodeless <- spread.tree.input;
nodeless$draw.node <- TRUE;
n <- table(nodeless$parent);
nodeless$draw.node[nodeless$parent %in% n[n >= 4]] <- FALSE;
parent | spread | draw.node | |
---|---|---|---|
1 | NA | 1.00 | TRUE |
2 | 1 | 2.00 | TRUE |
3 | 1 | 2.00 | TRUE |
4 | 1 | 3.00 | TRUE |
5 | 1 | 2.00 | TRUE |
6 | 2 | 0.50 | TRUE |
7 | 2 | 0.50 | TRUE |
8 | 3 | 0.75 | TRUE |
9 | 3 | 0.75 | TRUE |
10 | 3 | 0.75 | TRUE |
11 | 4 | 1.75 | FALSE |
12 | 4 | 1.75 | FALSE |
13 | 4 | 1.75 | FALSE |
14 | 4 | 1.75 | FALSE |
15 | 4 | 1.75 | FALSE |
16 | 4 | 1.75 | FALSE |
17 | 5 | 0.50 | TRUE |
18 | 5 | 0.75 | TRUE |
19 | 6 | 0.75 | FALSE |
20 | 6 | 0.75 | FALSE |
21 | 6 | 0.75 | FALSE |
22 | 7 | 1.00 | TRUE |
23 | 8 | 1.00 | TRUE |
24 | 9 | 0.50 | TRUE |
25 | 9 | 0.50 | TRUE |
Annotations can be added using a secondary dataframe to specify additional text corresponding to each node.
Each row must include a node ID for the text. Text will be stacked next to the branch preceeding the specified node.
simple.text.data <- simple.example$text[, c('name', 'node')];
simple.text.tree <- create.phylogenetic.tree(
tree = parent.only,
node.text = simple.text.data
);
To specify the distance between edges and text, the
node.text.line.dist
parameter can be used.
col
column can be included to specify the
colour of each text.fontface
column can be included to bold, italicize,
etc. These values correspond to the standard R fontface
values.NA
values in each column will default to
black
and plain
respectively.name | node | col | fontface |
---|---|---|---|
GENE1 | 2 | red | plain |
GENE2 | 2 | black | plain |
GENE3 | 2 | blue | NA |
GENE4 | 3 | NA | italic |
GENE5 | 3 | red | plain |
The default settings should produce a reasonable baseline plot, but
many users will want more control over their plot. This section will
highlight some of the most common parameters in SRCGRob
that can be passed in through create.phylogenetic.tree
.
The most recent common ancestor (MRCA) of all malignant subclones is
a descendant of normal or germline cells. The normal node is added to
the tree by setting the add.normal
parameter to
TRUE
. The size of this node can be specified with the
normal.cex
parameter.
Some plots require more or less horizontal padding between the x-axes
and the tree itself. The horizontal.padding
parameter
scales the default padding proportionally. For example,
horizontal.padding = -0.8
would reduce the padding by
80%.
The main title of the plot is referred to as main
in
plot parameters. main
sets the title text,
main.cex
sets the font size, and main.y
is
used to move the main title up if more space is required for the
plot.
The create.phylogenetic.tree
function can save the plot
to a file by specifying the filename
parameter. The file
type is determined by the file extension. Supported file types are TIFF,
PNG, PDF, and SVG. If a file extension is not specified, or is not one
of the supported formats, CEV will use the default TIFF format. Below
are the availble parameters, some of which are only applicable to
certain file formats.
Parameters | Description | TIFF | PNG | SVG | |
---|---|---|---|---|---|
filename |
Path to output file | x | x | x | x |
width |
Plot width | x | x | x | x |
height |
Plot height | x | x | x | x |
units |
Units for dimesions | x | x | ||
res |
Resolution | x | x | ||
bg |
Background colour | x | x |
The input data frame for visualizing CCF distribution should contain
the following columns: - ID
: Sample identifier -
SNV.id
: Unique SNV identifier; typically in the format
chr_pos_ref_alt
- CCF
: Cancer cell fraction
(CCF) of the SNV in the sample - clone.id
: Unique
identifier for the subclone where the SNV is assigned to
ID | SNV.id | CCF | clone.id |
---|---|---|---|
LN1 | 10_102747966_G_A | 1.08 | MRCA |
LN2 | 10_102747966_G_A | 1.25 | MRCA |
LN3 | 10_102747966_G_A | 1.35 | MRCA |
R1 | 10_102747966_G_A | 1.16 | MRCA |
R2 | 10_102747966_G_A | 0.87 | MRCA |
The create.ccf.heatmap
function is a wrapper for
BBoutrosLab.plotting.general::create.heatmap
, which creates
a heatmap of the SNV CCF distribution across multiple samples. The input
must be a 2D array containing CCF values, where each row represents an
SNV and each column represents a sample. Users can use the
data.frame.to.array
function to convert the input data
frame to the required array format.
To visualize the CCF distribution across subclones, the
create.cluster.heatmap
function can be used. This function
creates a heatmap of the SNV CCF values, ordered based on the cluster
they are assigned to. To limit CCF values to a certain range, the
ccf.limits
parameter can be used.
The create.ccf.summary.heatmap
function creates a
summary plot of the CCF distribution across samples. The plot shows the
median CCF values for each sample and subclone, as well as the number of
unique SNVs detected in a patien and assigned to each subclone. To
specify an order for the samples and subclones, the
sample.order
and clone.order
parameters can be
used.
# Calculate median CCF per sample
median.ccf <- aggregate(CCF ~ ID + clone.id, data = snv, FUN = median);
names(median.ccf)[names(median.ccf) == 'CCF'] <- 'median.ccf.per.sample';
snv <- merge(snv, median.ccf, by = c('ID', 'clone.id'));
create.ccf.summary.heatmap(
snv,
ccf.limits = c(0, 1),
clone.order = levels(snv$clone.id),
sample.order = levels(snv$ID)
);
To survey the genome-wide distribution of SNVs across subclones, the
create.clone.genome.distribution.plot
function can be used.
In the scatterplot, each point is an SNV and is coloured based on the
subclone it is assigned to, which is also visualized as a density plot
in the bottom plot.
Legend position can be defined using the legend.x
and
legend.y
parameters. By default legends are plotted inside
the plot area. To move the legend to the right side (outside) the plot
area, set legend.x
to a value greater than 1.
# Subset for clone size > 10
clone.nsnv <- aggregate(
SNV.id ~ clone.id,
data = snv,
FUN = function(x) length(unique(x))
);
large.clone <- clone.nsnv[clone.nsnv$SNV.id > 10, 'clone.id'];
sub.snv <- snv[snv$clone.id %in% large.clone, ];
sub.snv$clone.id <- factor(sub.snv$clone.id, levels = large.clone);
create.clone.genome.distribution.plot(
sub.snv,
legend.x = 1.1
);
## [1] "Plotting clone distribution across the genome for sample: all"
## $all