Content of article

## 2.2.3.2 Hierarchical Clustering

Given a dissimilarity matrix (computed by euclidean distances, for example), a hierarchical clustering procedure acts iteratively to cluster similar individuals (genes) by joining most similar individuals.

New dissimilarity measures (Ward, UPGMA, Lance-William, etc.) are calculated at each iteration between the new formed cluster and the remaining genes. In the final tree, genes with similar expression patterns are grouped together.

#Rscript

#calculate dissimilarity matrix

d.mat{amp}lt;-dist(fpkm.table, method=»euclidean»)

#cluster genes

hclust.dmat{amp}lt;-hclust(d.mat, method=»ward.D2″)

#plot tree

plot(hclust.dmat)

#create cluster by cutting tree

#first, observe how the tree is cut for different thresholds; try different k values

rect.hclust(hclust.dmat, k=5)

#get cluster composition and size

cluster.hclust.dmat{amp}lt;-cutree(hclust.dmat, k=5)

cluster.size{amp}lt;-sapply(levels(as.factor(cluster.hclust.dmat)), function(x)length(which(cluster.hclust.dmat==x)))

names(cluster.size){amp}lt;-paste(«Cluster», levels(as.factor(cluster.hclust.dmat)), sep=»»)

#plot cluster expression profiles

mean.clust{amp}lt;-lapply(1:nlevels(as.factor(cluster.hclust.dmat)), function(x){

cbind.data.frame(«Mean»=apply(fpkm.table[which(cluster.hclust.dmat==x),], 2, mean),»Sample»=colnames(fpkm.table), «Cluster»=rep(paste(x, cluster.size[x], sep=»_»), ncol(fpkm.table)))})

mean.clust.table{amp}lt;-do.call(«rbind», mean.clust)

p{amp}lt;-ggplot(mean.clust.table, aes(x=Sample, y=Mean))

p geom_point() geom_line(aes(group=Cluster)) facet_wrap(~ Cluster, ncol=4)

#retrieve annotation for a given cluster

tmp.list{amp}lt;-names(which(cluster.hclust.dmat==clusternumber))

annot.tmp{amp}lt;-sapply(tmp.list, function(x)which(annot.ok[,1] %in% x)[1])

annot.tmp.2{amp}lt;-annot.ok[annot.tmp,]

#export as a csv file, readable in any Microsoft Office Excel or LibreOffice Calc

write.csv(annot.tmp.2, «annotation_genes_clusternumber.csv»)

In addition, agglomerative clustering may also be tested for investigation purposes with the “agnes” function from the Cluster package. Many dissimilarity measures are available with parameters (arguments to the dissimilarity methods) that may be fine-tuned to improve clustering.

## UPGMA analysis

Analysis: an example with the Pair Group Method

Given a matrix of pairwise distances among

taxa, cluster analysis attempts to represent this information in a diagram

called a phenogramthat expresses the overall similarities among

taxa. The Pair Group Method uses the following algorithm[a repetitive process for accomplishing a task]: (1) Identify

the minimum distance between any two taxa, (2) Combine these two

taxa as a single pair, (3) Re-calculate the average distance between

this pair and all other taxa to form a new matrix, (4) identifies

the closest pair in the new matrix, (5) and so on, until the last

two clusters are joined.

Consider five taxa (A, B, C, D, E)

with the following distance matrix (the data could be molecular or morphological distances):

A | B | C | D | E | |

A | 0 | — | — | — | — |

B | 20 | 0 | — | — | — |

C | 60 | 50 | 0 | — | — |

D | 100 | 90 | 40 | 0 | — |

E | 90 | 80 | 50 | 30 | 0 |

A {amp}amp; B are closest (20

units): join them into one cluster (AB) joining at 20, and recalculate

the average distance from

C, D, and E to (AB).

[For example, the distance from C to (AB) = (60 50)/2 =

55, and the distance from D

to (AB) = (100 90)/2 = 95].

This gives:

(AB) | C | D | E | |

(AB) | 0 | — | — | — |

C | 55 | 0 | — | — |

D | 95 | 40 | 0 | — |

E | 85 | 50 | 30 | 0 |

D {amp}amp; E are closest (30

units): join them into one cluster (DE) joining at 30, and recalculate

the average distances between (AB), C, and (DE). [For

example, the distance from (AB) to (DE) = (95 85)/2 = 90].

This gives:

(AB) | C | (DE) | |

(AB) | 0 | — | — |

C | 55 | 0 | — |

(DE) | 90 | 45 | 0 |

C {amp}amp; (DE) are closest

(45 units): join them into one cluster (CDE) joining at 45, and

recalculate the average distance between (CDE) and (AB).

This gives:

(AB) | (CDE) | |

(AB) | 0 | — |

(CDE) | 72.5 | 0 |

The two clusters join at 72.5. This completes

the analysis.

The method illustrated is a Weighted PGM with Averaging (WPGMA). See the commentary on calculations forthe difference between weightedand unweighted analyses(WPGMA and UPGMA).

These results may

be presented as a phenogram with nodes at 20, 30, 45, and 72.5 units.

The phenogram can be interepreted as indicating that A {amp}amp; B

are similar to each other, as are D {amp}amp; E, and that C

is more similar to D {amp}amp; E :

Text material © 2007 by Steven M. Carr