UPGMA (unweighted pair group method with arithmetic mean)
The UPGMA is the simplest distance-matrix method, and it employs sequential clustering to build a rooted phylogenetic tree. It first compares all sequences through pairwise alignment to compute the distance matrix.
Given a distance matrix, it starts by grouping two taxa with the smallest pairwise distance in the distance matrix. A node is placed at the midpoint or half distance between them. It then creates a reduced matrix by treating the new cluster as a single taxon. The distances between this new composite taxon and all remaining taxa are calculated to create a reduced matrix. The same grouping process is repeated and another newly reduced matrix is created. The iteration continues until all taxa are placed on the tree. The last taxon added is considered the outgroup producing a rooted tree. The basic assumption of the UPGMA method is that all taxa evolve at a constant rate and that they are equally distant from the root, implying that a molecular clock is in effect. Thus, UPGMA often produces erroneous tree topologies. However, owing to its fast speed of calculation, it has found extensive usage in clustering analysis of DNA microarray data.
The great disadvantage of UPGMA is that it assumes the same evolutionary speed on all lineages, i.e. the rate of mutations is constant over time and for all lineages in the tree. This is called a ‘molecular clock hypothesis’.
This would mean that all leaves (terminal nodes) have the same distance from the root. In reality the individual branches are very unlikely to have the same mutation rate. Therefore, UPGMA frequently generates wrong tree topologies.
The UPGMA algorithm
- UPGMA starts with a matrix of pairwise distances D[1..n, 1..m]
- In this method each sample (taxon, operational taxonomic unit=OTU) is shown as a ‘cluster’
- It starts by assigning all clusters (samples) to a star-like tree
- Find that pair (cluster i and j) with the smallest distance value in the distance matrix: D[i,j]
- Define a new cluster comprising cluster i and j
- Cluster i is connected by a branch to the common ancestor node. The same applies for cluster j.
Therefore, the distance D[i,j] is split onto the two branches. So, each of the two branches obtains a length of D[i,j]/2.
- If i and j were the last 2 clusters, the tree is finished. If not the algorithm finds a new cluster called u.
- Define the distance from u to each other cluster (k, with k <> i or j) to be an average of the distances dki and dkj.
For ‘Weighted PGMA (WPGM)’: dku = dki+dkj/2).
For ‘Complete linkage’: dku = max(dki, dkj).
For ‘Single linkage’: dku = min(dki, dkj).
- Delete columns and rows corresponding to I and j and add one for (ij). If there are two or more groups left, go back to step 1 with one less cluster. Clusters i and j are eliminated, and cluster u is added to the tree.