Selecting optimal number of clusters in KMeans Algorithm(Silhouette Score)
KMeans is a unsupervised machine learning technique primarily used for the clubbing similar instances together. In case of supervised learning, there are multiple ways of validating the results. But in case of unsupervised learning, the only way to validate the score is trough visualization. But, this technique also has its limitations which is the higher number of dimensions.
While training the model, an expert is required to decide the number of the clusters that should be there. This becomes very difficult as the dimension of the data increases. There are two simple and popular methods of doing this.
- Inertia: It is defined as the mean squared distance between each instance and its closest centroid. Logically, as per the definition lower the inertia better the model. In general, the KMeans package of sklearn runs the algorithm for ‘n_init’ number of times and chooses the one having the lowest inertia. This approcah has a limitation, as the number of clusters increases, closest will be the clusters from the centroids and lower will be the inertia. Therefore, in this case, it is a good idea to plot the graph of inertia as the number of clusters increase. Elbow rule is used in orde rto find the optimal number of clusters. As depicted in the following diagram, curve looks like a hand and the number of clusters to be chosen over there should be equal to 3 as after that curve reaches a plateau.
2. Silhouette Score: This is a better measure to decide the number of clusters to be formulated from the data. It is calculated for each instance and the formula goes like this:
Silhouette Coefficient = (x-y)/ max(x,y)
where, y is the mean intra cluster distance: mean distance to the other instances in the same cluster. x depicts mean nearest cluster distance i.e. mean distance to the instances of the next closest cluster.
The coefficient varies between -1 and 1. A value close to 1 implies that the instance is close to its cluster is a part of the right cluster. Whereas, a value close to -1 means that the value is assigned to the wrong cluster.
As per this method k=3 was a local optima, whereas k=5 should be chosen for the number of clusters. This method is better as it makes the decision regarding the optimal number of clusters more meaningful and clear. But this metric is computation expensive as the coefficient is calculated for every instance. Therefore, decision regarding the optimal metric to be chosen for the number of cluster decision is to be made according to the needs of the product.