UPGMA — an overview | ScienceDirect Topics

2.2.3.2 Hierarchical Clustering

Given a dissimilarity matrix (computed by euclidean distances, for example), a hierarchical clustering procedure acts iteratively to cluster similar individuals (genes) by joining most similar individuals.

New dissimilarity measures (Ward, UPGMA, Lance-William, etc.) are calculated at each iteration between the new formed cluster and the remaining genes. In the final tree, genes with similar expression patterns are grouped together.

#Rscript

#calculate dissimilarity matrix

d.mat{amp}lt;-dist(fpkm.table, method=»euclidean»)

#cluster genes

hclust.dmat{amp}lt;-hclust(d.mat, method=»ward.D2″)

#plot tree

plot(hclust.dmat)

#create cluster by cutting tree

#first, observe how the tree is cut for different thresholds; try different k values

rect.hclust(hclust.dmat, k=5)

#get cluster composition and size

cluster.hclust.dmat{amp}lt;-cutree(hclust.dmat, k=5)

cluster.size{amp}lt;-sapply(levels(as.factor(cluster.hclust.dmat)), function(x)length(which(cluster.hclust.dmat==x)))

names(cluster.size){amp}lt;-paste(«Cluster», levels(as.factor(cluster.hclust.dmat)), sep=»»)

#plot cluster expression profiles

mean.clust{amp}lt;-lapply(1:nlevels(as.factor(cluster.hclust.dmat)), function(x){

cbind.data.frame(«Mean»=apply(fpkm.table[which(cluster.hclust.dmat==x),], 2, mean),»Sample»=colnames(fpkm.table), «Cluster»=rep(paste(x, cluster.size[x], sep=»_»), ncol(fpkm.table)))})

mean.clust.table{amp}lt;-do.call(«rbind», mean.clust)

p{amp}lt;-ggplot(mean.clust.table, aes(x=Sample, y=Mean))

p geom_point() geom_line(aes(group=Cluster)) facet_wrap(~ Cluster, ncol=4) 

#retrieve annotation for a given cluster

tmp.list{amp}lt;-names(which(cluster.hclust.dmat==clusternumber))

annot.tmp{amp}lt;-sapply(tmp.list, function(x)which(annot.ok[,1] %in% x)[1])

annot.tmp.2{amp}lt;-annot.ok[annot.tmp,]

#export as a csv file, readable in any Microsoft Office Excel or LibreOffice Calc

write.csv(annot.tmp.2, «annotation_genes_clusternumber.csv»)

In addition, agglomerative clustering may also be tested for investigation purposes with the “agnes” function from the Cluster package. Many dissimilarity measures are available with parameters (arguments to the dissimilarity methods) that may be fine-tuned to improve clustering.

UPGMA analysis

Cluster
Analysis: an example with the Pair Group Method

Given a matrix of pairwise distances among
taxa, cluster analysis attempts to represent this information in a diagram
called a phenogramthat expresses the overall similarities among
taxa. The Pair Group Method uses the following algorithm[a repetitive process for accomplishing a task]:  (1) Identify
the minimum distance between any two taxa, (2) Combine these two
taxa as a single pair, (3) Re-calculate the average distance between
this pair and all other taxa to form a new matrix, (4) identifies
the closest pair in the new matrix, (5) and so on, until the last
two clusters are joined.

Consider five taxa (A, B, C, D, E)
with the following distance matrix (the data could be molecular or morphological distances):

A B C D E
A 0
B 20 0
C 60 50 0
D 100 90 40 0
E 90 80 50 30 0

A {amp}amp; B are closest (20
units): join them into one cluster (AB) joining at 20, and recalculate
the average distance from
C, D, and E to (AB).
[For example, the distance from C to (AB) = (60 50)/2 =
55, and the distance from D
to (AB) = (100 90)/2 = 95].
This gives:

(AB) C D E
(AB) 0
C 55 0
D 95 40 0
E 85 50 30 0

D {amp}amp; E are closest (30
units): join them into one cluster (DE) joining at 30, and recalculate
the average distances between (AB), C, and (DE). [For
example, the distance from (AB) to (DE) = (95 85)/2 = 90].
This gives:

(AB) C (DE)
(AB) 0
C 55 0
(DE) 90 45 0

C {amp}amp; (DE) are closest
(45 units): join them into one cluster (CDE) joining at 45, and
recalculate the average distance between (CDE) and (AB).
This gives:

(AB) (CDE)
(AB) 0
(CDE) 72.5 0

The two clusters join at 72.5. This completes
the analysis.

The method illustrated is a Weighted PGM with Averaging (WPGMA). See the commentary on calculations forthe difference between weightedand unweighted analyses(WPGMA and UPGMA).

These results may
be presented as a phenogram with nodes at 20, 30, 45, and 72.5 units.
The phenogram can be interepreted as indicating that A {amp}amp; B
are similar to each other, as are D {amp}amp; E, and that C
is more similar to D {amp}amp; E :


Text material © 2007 by Steven M. Carr
Понравилась статья? Поделиться с друзьями:
Website Name