DCF: An Efficient and Robust Density-Based Clustering Method

Mimi Zhang [ https://orcid.org/0000-0002-3807-297X ]

This notebook introduces the DCF algorithm for cluster analysis.

Introduction

DCF is the acronym for density-core finding, an improvement on the popular density-peak clustering (DPC) method. The DPC method detects modes as points with (1) high density and (2) large distance to points of higher density. Hence, it often fails to adequately represent clusters with areas of relatively uniform density.

DCF is both efficient and robust, gracefully scaling to big datasets. The improvements offered by DCF are twofold:

The improvements are the result of directing the peak-finding technique to discover modal sets, rather than point modes.

The Algorithm

The DCF algorithm consists of the following main steps:

(1) Compute the peak-finding criterion for each data point and select the instance $\mathbf{x}$ with maximal value.

(2) The component set $\mathbf{S}_{\beta}(\mathbf{x})$ containing $\mathbf{x}$ is the first cluster core. All points in $\mathbf{S}_{\beta}(\mathbf{x})$ are labelled with $Assessed$ and excluded from further consideration.

(3) The instance $\mathbf{x}$ with maximal value of the peak-finding criterion, yet to be assessed, is selected.

(4) The component set $\mathbf{S}_{\beta}(\mathbf{x})$ containing $\mathbf{x}$ is found.

  • All points in $\mathbf{S}_{\beta}(\mathbf{x})$ are labelled with $Assessed$ and excluded from further consideration.
  • $\mathbf{S}_{\beta}(\mathbf{x})$ is another cluster core only if it is disjoint from all previously detected cluster cores.

(5) Repeat Steps 3 and 4 until all points have been labelled with $Assessed$.

Each cluster core represents a different cluster, and the number of detected cluster cores is the number of clusters in the data. After Step 5, there will be many data points that are labelled with $Assessed$ but do not belong to any cluster core set. These non-core points are from the component sets $\mathbf{S}_{\beta}(\mathbf{x})$ in Step 4 that overlap with previously detected cluster cores.

(6) Each non-core point is assigned to the same cluster as its nearest neighbor of higher density.

DCF uses the peak-finding criterion of DPC to detect cluster cores. The peak-finding criterion is computed for each data point, and the instance with maximum value is selected as a center (Step 1). The cluster core containing this center is then found, and all instances belonging to it are removed from consideration as potential centers (Step 2). The algorithm continues detecting cluster cores until no more remain in the data (Steps 3-5). The allocation procedure is unchanged from the DPC method: each non-center point is allocated to the same cluster as its nearest neighbor of higher estimated density (Step 6).

The Two Inputs

The algorithm requires two user inputs: the neighborhood parameter $k$ (for density estimation) and the fluctuation parameter $\beta$ (for determining how much the density estimates can vary within a cluster core). In particular, the algorithm adopts a non-parametric density estimator: for any data point $\mathbf{x}\in\mathrm{R}^d$, its density estimate is \begin{equation*} \hat{f}(\mathbf{x})=\frac{k}{\mbox{sample size}\times \mbox{volume of the unit sphere in } \mathrm{R}^d \times r_k(\mathbf{x})^d}, \end{equation*} where $r_k(\mathbf{x})$ is the distance between $\mathbf{x}$ and its $k$th nearest neighbor in $\mathcal{X}$. For a user, specifying the value $k$ is much easier than specifying the cutting-off distance value in the DPC algorithm.

The definition of cluster core depends on the notion of mutual $k$-NN graph. Let $G$ denote the undirected mutual $k$-NN graph constructed from the whole dataset $\mathcal{X}$. (There is an edge between two vertices in $G$, if and only if they are one (out of $k$) nearest neighbor of each other.) We here explain how the first cluster core in Step 2 is determined. Let $\mathbf{x}^*$ be the point with the maximal value of the peak-finding criterion. Then the subset $S=\{\mathbf{x}\in\mathcal{X}: \hat{f}(\mathbf{x})\geq (1-\beta)\hat{f}(\mathbf{x}^*)\}$ contains all the data points with their density estimate ranging in the interval $[(1-\beta)\hat{f}(\mathbf{x}^*), \hat{f}(\mathbf{x}^*)]$. The subset $S$ determines a sub-graph of $G$:

Then the cluster core $\mathbf{S}_{\beta}(\mathbf{x}^*)$ is simply the connected component of the sub-graph that contains the point/vertex $\mathbf{x}^*$.

Point Modes vs Cluster Cores

An illustrative example demonstrating the benefits of seeking cluster cores of the density. The black curve represents the underlying density and the grey histogram represents a sample from the density. Left: DPC incorrectly selects both centers from the first cluster, as the noise in the density estimate causes the peak-finding method to favor the high-density cluster. Right: Cluster cores, represented by dashed lines, better represent the cluster centers.

Comparison of the DPC and DCF methods when applied to the Noisy Circles dataset. The Noisy Circles dataset contains two clusters, a high-density cluster (inner circle) and a low-density cluster (outer circle). The DPC method, as seen on the left of the figure, proceeds by searching for the points with maximal values of the peak-finding criterion. This method erroneously selects the multiple centers from the inner cluster. The allocation mechanism incorrectly assigns all points in the outer cluster. For this example, seven points in the inner cluster have larger value of the peak-finding criterion than the maximum value in the outer cluster.

The DCF procedure can be seen on the right of the figure. We first select the instance of maximum density as the first peak. However, DCF proceeds to compute the cluster core associated with this point, the highlighted larger green points, and remove all elements of the core from consideration as centers. Of those remaining, the point with maximal value of the peak-finding criterion is in the outer cluster. The associated cluster core is visible in yellow. As no edge in the $k$-NN graph exists between this cluster core and the first cluster core, it is accepted as a valid cluster core. The algorithm's termination procedure is invoked when assessing a third center. The third center is selected as before; however, as the cluster core associated with this point contains all of the instances in the dataset, the algorithm terminates.

Code Examples

The code examples below were mainly contributed by Sachit Bhardwaj.

Example 1

Example 2

Example 3

Example 4

Example 5

Seed variety dataset consists of measurements of geometrical properties of kernels belonging to three different varieties of wheat. A soft X-ray technique and GRAINS package were used to construct all seven, real-valued attributes. The target class is dropped for the demonstration of the DCF algorithm. This dataset is widely available and downloaded from Kaggle

Three clusters were detected.

Example 6

The wine quality dataset contains records related to red and white variants of the Portuguese Vinho Verde wine. It contains information from 1599 red wine samples and 4898 white wine samples. Input variables in the dataset consist of the type of wine (either red or white wine) and metrics from objective tests (acidity levels, PH values, ABV, etc.), while the target/output variable is a numerical score based on sensory data—median of at least 3 evaluations made by wine experts. Each expert graded the wine quality between 0 (very bad) and 10 (very excellent). This dataset is downloaded from UCI Machine Learning Repository

We have to check for null values and replace them with zero.

Two clusters were detected.