Hierarchical Clustering in Machine Learning

Mohit Uniyal

Machine Learning

Hierarchical clustering is a powerful unsupervised machine learning algorithm used to group data points into a hierarchy of clusters. It is particularly useful when the number of clusters is not predefined, and it helps to visualize the data’s structure through a dendrogram, which represents the nested clustering relationships.

Hierarchical clustering finds applications across various domains, such as biology, marketing, and social network analysis, where understanding relationships between data points is crucial. This guide will explore the types of hierarchical clustering, how they work, and their advantages over other clustering methods, along with a step-by-step Python implementation.

Why Hierarchical Clustering?

Hierarchical clustering is often preferred over other clustering methods, such as K-means, because it does not require specifying the number of clusters upfront. It also provides more flexibility in terms of visualizing data at different levels of granularity, making it suitable for hierarchical relationships like those seen in biological taxonomy.

Advantages:

  • No predefined clusters: Hierarchical clustering does not require specifying the number of clusters in advance.
  • Dendrogram for visualization: It allows the creation of dendrograms, providing a visual representation of clusters at various levels.
  • Cluster hierarchy: The hierarchical nature helps reveal nested clusters.

Limitations:

  • Computational complexity: Hierarchical clustering can be computationally expensive for large datasets.
  • Inflexibility: Once a merge or split is made, it cannot be undone.

Despite its limitations, hierarchical clustering excels in applications where understanding the hierarchical relationships between data points is important.

Types of Hierarchical Clustering

Hierarchical clustering can be classified into two main types:

Agglomerative Hierarchical Clustering:

The agglomerative approach is a bottom-up method where each data point starts as its own cluster, and clusters are merged iteratively until only one cluster remains. The merging process is based on distance measures such as single linkage, complete linkage, or Ward’s method.

Divisive Hierarchical Clustering:

The divisive approach is the opposite, using a top-down method. It starts with all data points in a single cluster and recursively splits them into smaller clusters. This method is less commonly used due to its complexity.

Comparison:

Agglomerative clustering is more popular due to its simplicity and lower computational demands compared to divisive clustering. Divisive clustering may be more appropriate in cases where a top-down partitioning of data is needed.

Agglomerative Hierarchical Clustering

Agglomerative hierarchical clustering starts with each data point as its own cluster and repeatedly merges the closest clusters based on a chosen distance metric. The process continues until a single cluster encompasses all data points or until the desired number of clusters is reached.

Example:

Consider a dataset with five points. The algorithm begins by calculating the pairwise distances between all points and merges the two closest points into a cluster. This process repeats, merging the nearest clusters at each step, until only one cluster remains.

Agglomerative clustering is computationally less expensive than divisive clustering, making it the more commonly used method. The final result is a dendrogram, which shows the hierarchy of clusters and can be used to determine the optimal number of clusters by cutting the dendrogram at a certain height.

How Agglomerative Hierarchical Clustering Works

The agglomerative hierarchical clustering algorithm follows a structured process:

Step-by-Step Process:

  1. Initial Clusters: Each data point starts as its own cluster.
  2. Merging Clusters: At each iteration, the two closest clusters are merged based on a distance metric.
  3. Stopping Criteria: The process stops when a single cluster remains or when the desired number of clusters is reached.

Distance Measures:

Different linkage criteria are used to calculate the distance between clusters:

  • Single Linkage: The distance between the two closest points in different clusters.
  • Complete Linkage: The distance between the two farthest points in different clusters.
  • Average Linkage: The average distance between all points in two different clusters.
  • Ward’s Method: Minimizes the variance between clusters and is often preferred for its tendency to create clusters of similar size.

Distance Measurement Example:

In single linkage, if clusters A and B are being merged, the algorithm finds the pair of points between A and B that are closest together and merges the clusters based on that distance.

Agglomerative clustering continues this process until all points are merged into one cluster, with the dendrogram visualizing the hierarchy of cluster merges.

Dendrogram in Hierarchical Clustering

A dendrogram is a tree-like diagram used to visualize the arrangement of clusters formed by hierarchical clustering. It shows how individual data points or clusters are merged at each iteration of the algorithm.

Interpreting a Dendrogram:

Each leaf of the dendrogram represents a single data point. As you move upward, the branches represent merges between clusters. The height at which two clusters are merged represents the distance or dissimilarity between them.

Example:

In a dataset of animals, a dendrogram might show how individual species are grouped based on similarities, with closely related species merging into groups at lower heights, and more distantly related species merging at higher points.

Deciding the Number of Clusters:

The dendrogram helps determine the optimal number of clusters by “cutting” it at a certain level. The horizontal line drawn across the dendrogram at a chosen height will intersect the branches at different points, indicating the number of clusters at that level.

Dendrograms provide an intuitive way to visualize hierarchical clustering and decide how many clusters best represent the data.

Divisive Hierarchical Clustering

Divisive hierarchical clustering starts with all data points in a single cluster and recursively splits them into smaller clusters until each data point forms its own cluster. This top-down approach is less commonly used due to its computational intensity, but it can be useful in situations where a global perspective of the data structure is needed from the outset.

Key Differences from Agglomerative Clustering:

  • Complexity: Divisive clustering is more computationally expensive as it requires considering all possible ways to split a cluster at each step.
  • Use Cases: Divisive clustering is ideal when the data has a natural top-down hierarchy, such as organizational structures or family trees.

While agglomerative clustering is more widely used, divisive clustering offers valuable insights in specific scenarios where a top-down understanding of the data structure is required.

Hierarchical Clustering vs. K-Means Clustering

Hierarchical clustering and K-means clustering are both popular clustering techniques, but they differ in several key ways:

Key Differences:

  1. Cluster Numbers: K-means requires specifying the number of clusters in advance, while hierarchical clustering does not.
  2. Algorithm Structure: Hierarchical clustering builds a hierarchy of clusters, while K-means aims to partition the data into K clusters directly.
  3. Dendrogram: Hierarchical clustering produces a dendrogram, allowing for better visualization of the data’s structure, whereas K-means does not.

Benefits of Hierarchical Clustering:

  • No need to predefine the number of clusters.
  • Better for small datasets and hierarchical data.

Benefits of K-Means Clustering:

  • Faster and more efficient for large datasets.
  • Works well when the number of clusters is known.

When to Use:

Hierarchical clustering is ideal when you want to visualize the data hierarchy and don’t know the number of clusters upfront. K-means is preferred for larger datasets where speed and efficiency are more important.

Python Implementation of Hierarchical Clustering

Here’s a step-by-step guide to implementing agglomerative hierarchical clustering in Python using SciPy and Matplotlib.

Step 1: Data Preprocessing

import numpy as np

import pandas as pd

from sklearn.datasets import make_blobs

# Generate sample data

X, y = make_blobs(n_samples=100, centers=4, cluster_std=0.60, random_state=0)

Step 2: Using Dendrogram to Find the Optimal Number of Clusters

import scipy.cluster.hierarchy as sch

import matplotlib.pyplot as plt

# Create the dendrogram

plt.figure(figsize=(10, 7))

dendrogram = sch.dendrogram(sch.linkage(X, method='ward'))

plt.title('Dendrogram')

plt.xlabel('Data Points')

plt.ylabel('Euclidean distances')

plt.show()

This dendrogram helps decide the optimal number of clusters by observing the largest vertical gap in the diagram.

Step 3: Training the Hierarchical Clustering Model

from sklearn.cluster import AgglomerativeClustering

# Train the hierarchical clustering model

hc = AgglomerativeClustering(n_clusters=4, affinity='euclidean', linkage='ward')

y_hc = hc.fit_predict(X)

Step 4: Visualizing the Clusters

plt.scatter(X[y_hc == 0, 0], X[y_hc == 0, 1], s=100, c='red', label='Cluster 1')

plt.scatter(X[y_hc == 1, 0], X[y_hc == 1, 1], s=100, c='blue', label='Cluster 2')

plt.scatter(X[y_hc == 2, 0], X[y_hc == 2, 1], s=100, c='green', label='Cluster 3')

plt.scatter(X[y_hc == 3, 0], X[y_hc == 3, 1], s=100, c='cyan', label='Cluster 4')

plt.title('Clusters of Data')

plt.xlabel('Feature 1')

plt.ylabel('Feature 2')

plt.legend()

plt.show()

This code visualizes the data points in their respective clusters.

Applications of Hierarchical Clustering

Hierarchical clustering is widely used across various fields:

  • Biology: Used to classify species based on genetic information, hierarchical clustering helps visualize evolutionary relationships.
  • Marketing: Businesses use hierarchical clustering for customer segmentation, enabling them to target specific groups with tailored marketing strategies.
  • Social Network Analysis: It helps uncover hidden community structures in social networks by clustering individuals based on interactions and relationships.
  • Image Processing: Hierarchical clustering is useful in image segmentation, helping identify regions of interest in an image.

Real-World Example:

In the healthcare industry, hierarchical clustering is used to analyze patient data, helping identify subgroups of patients with similar characteristics for personalized treatments.

Hierarchical clustering’s flexibility makes it a valuable tool for a range of domains that require a deeper understanding of relationships within the data.

Challenges and Limitations

While hierarchical clustering offers unique benefits, it also faces several challenges:

  • Memory and Computational Complexity: Hierarchical clustering is computationally expensive, especially for large datasets, as it requires calculating and storing the distances between all pairs of data points.
  • Difficulty with Large Datasets: As the dataset size increases, the performance of hierarchical clustering can degrade, making it less suitable for big data applications.
  • Irreversible Merging: Once clusters are merged or split, the algorithm does not revisit earlier decisions, which could result in suboptimal clusters.

Solutions:

To overcome these limitations, consider using Ward’s method, which minimizes the variance between clusters and reduces computational costs, or opt for sampling techniques to handle large datasets efficiently.

Conclusion

Hierarchical clustering is a versatile and intuitive clustering technique that provides a clear visual representation of data through dendrograms. Its ability to uncover nested relationships makes it particularly useful for small to medium-sized datasets and applications like biology, marketing, and social network analysis. While it has computational limitations, especially with large datasets, it remains a valuable tool for data scientists who need to explore and visualize complex relationships between data points. By following the steps outlined in this guide, readers can implement and experiment with hierarchical clustering to gain deeper insights into their data.

References: