K-means clustering is a machine learning algorithm that is used to partition the data into K clusters in which each data belongs to the cluster with the nearest mean.
KMC works by calculating the distance of each data in the dataset to K randomly selected points. After this, each data point is assigned to the cluster to which it is nearest. Then, the mean is calculated in each cluster to find a new center for each cluster. Again, the process is repeated till the center remains the same for all the clusters.
In this post, we will write the program for the K-means clustering. We will use python to write this program and we will not use any libraries.
Input
We have the dataset given below. It consists of 2D points. We will use matplotlib to plot the points, although this is optional.
data = [ [5,2], [2,4], [9,5], [4,6], [5,2], [1,5], [6,7], [4,2], [6,4], [9,2], [4,5], [1,6], [4,7], [3,6], [1,1], [8,4], [8,7], [7,2], [2,2], [2,1], [1,2], [1,4], [2,6], [7,7], [7,4], [3,4], [1,4] ] x = [i[0] for i in data] y = [i[1] for i in data] import matplotlib.pyplot as plt plt.scatter(x,y) plt.show()
Distance function
Now, we will define a function to calculate the distance between two points.
import math def dist(center, point): d = 0.0 for i in range(0,len(point)): d += (center[i]-point[i])**2 return math.sqrt(d)
Defining functions
We will now define the functions to assign centers to the dataset and also to find the mean of the clusters.
def assignCenters(centers, dataset): clusters = [] for i in range(len(dataset)): distances = [] for center in centers: distances.append(dist(center, dataset[i])) temp = [z for z, val in enumerate(distances) if val==min(distances)] clusters.append(temp[0]) return clusters def mean_center(k, dataset, clusters): nCenters = [] for i in range(k): x = 0.0 y = 0.0 count = 0 for j in range(len(clusters)): if(i == clusters[j]): x += dataset[j][0] y += dataset[j][1] count += 1 x = x/count y = y/count nCenters.append([x,y]) return nCenters
Inputting initial points
Now we will input the initial points and value of K from the user.
print("enter k") k = int(input()) centers = [] for i in range(k): print("enter center "+str(i)) temp = [int(x) for x in input().split()] centers.append(temp)
enter k 2 enter center 0 6 4 enter center 1 9 2
Algorithm
Now, we will apply the K-means clustering algorithm.
print("Initial centers: ") print(centers) print("Initial clusters: ") clusters = assignCenters(centers, data) for i in range(k): print("cluster "+str(i)) for j in range(len(clusters)): if(i == clusters[j]): print(data[j],end=' ') print() print() for itr in range(10): print("Iteration "+str(itr)) centers = mean_center(k,data,clusters) print("Updated centers: ") print(centers) clusters = assignCenters(centers, data) print("Updated clusters: ") for i in range(k): print("cluster "+str(i)) for j in range(len(clusters)): if(i == clusters[j]): print(data[j],end=' ') print() print()
Initial centers: [[6, 4], [9, 2]] Initial clusters: cluster 0 [5, 2] [2, 4] [4, 6] [5, 2] [1, 5] [6, 7] [4, 2] [6, 4] [4, 5] [1, 6] [4, 7] [3, 6] [1, 1] [8, 4] [8, 7] [2, 2] [2, 1] [1, 2] [1, 4] [2, 6] [7, 7] [7, 4] [3, 4] [1, 4] cluster 1 [9, 5] [9, 2] [7, 2] Iteration 0 Updated centers: [[3.6666666666666665, 4.25], [8.333333333333334, 3.0]] Updated clusters: cluster 0 [5, 2] [2, 4] [4, 6] [5, 2] [1, 5] [6, 7] [4, 2] [6, 4] [4, 5] [1, 6] [4, 7] [3, 6] [1, 1] [2, 2] [2, 1] [1, 2] [1, 4] [2, 6] [3, 4] [1, 4] cluster 1 [9, 5] [9, 2] [8, 4] [8, 7] [7, 2] [7, 7] [7, 4] Iteration 1 Updated centers: [[2.9, 4.0], [7.857142857142857, 4.428571428571429]] Updated clusters: cluster 0 [5, 2] [2, 4] [4, 6] [5, 2] [1, 5] [4, 2] [4, 5] [1, 6] [4, 7] [3, 6] [1, 1] [2, 2] [2, 1] [1, 2] [1, 4] [2, 6] [3, 4] [1, 4] cluster 1 [9, 5] [6, 7] [6, 4] [9, 2] [8, 4] [8, 7] [7, 2] [7, 7] [7, 4] Iteration 2 Updated centers: [[2.5555555555555554, 3.8333333333333335], [7.444444444444445, 4.666666666666667]] Updated clusters: cluster 0 [5, 2] [2, 4] [4, 6] [5, 2] [1, 5] [4, 2] [4, 5] [1, 6] [4, 7] [3, 6] [1, 1] [2, 2] [2, 1] [1, 2] [1, 4] [2, 6] [3, 4] [1, 4] cluster 1 [9, 5] [6, 7] [6, 4] [9, 2] [8, 4] [8, 7] [7, 2] [7, 7] [7, 4] Iteration 3 Updated centers: [[2.5555555555555554, 3.8333333333333335], [7.444444444444445, 4.666666666666667]] Updated clusters: cluster 0 [5, 2] [2, 4] [4, 6] [5, 2] [1, 5] [4, 2] [4, 5] [1, 6] [4, 7] [3, 6] [1, 1] [2, 2] [2, 1] [1, 2] [1, 4] [2, 6] [3, 4] [1, 4] cluster 1 [9, 5] [6, 7] [6, 4] [9, 2] [8, 4] [8, 7] [7, 2] [7, 7] [7, 4] Iteration 4 Updated centers: [[2.5555555555555554, 3.8333333333333335], [7.444444444444445, 4.666666666666667]] Updated clusters: cluster 0 [5, 2] [2, 4] [4, 6] [5, 2] [1, 5] [4, 2] [4, 5] [1, 6] [4, 7] [3, 6] [1, 1] [2, 2] [2, 1] [1, 2] [1, 4] [2, 6] [3, 4] [1, 4] cluster 1 [9, 5] [6, 7] [6, 4] [9, 2] [8, 4] [8, 7] [7, 2] [7, 7] [7, 4] Iteration 5 Updated centers: [[2.5555555555555554, 3.8333333333333335], [7.444444444444445, 4.666666666666667]] Updated clusters: cluster 0 [5, 2] [2, 4] [4, 6] [5, 2] [1, 5] [4, 2] [4, 5] [1, 6] [4, 7] [3, 6] [1, 1] [2, 2] [2, 1] [1, 2] [1, 4] [2, 6] [3, 4] [1, 4] cluster 1 [9, 5] [6, 7] [6, 4] [9, 2] [8, 4] [8, 7] [7, 2] [7, 7] [7, 4] Iteration 6 Updated centers: [[2.5555555555555554, 3.8333333333333335], [7.444444444444445, 4.666666666666667]] Updated clusters: cluster 0 [5, 2] [2, 4] [4, 6] [5, 2] [1, 5] [4, 2] [4, 5] [1, 6] [4, 7] [3, 6] [1, 1] [2, 2] [2, 1] [1, 2] [1, 4] [2, 6] [3, 4] [1, 4] cluster 1 [9, 5] [6, 7] [6, 4] [9, 2] [8, 4] [8, 7] [7, 2] [7, 7] [7, 4] Iteration 7 Updated centers: [[2.5555555555555554, 3.8333333333333335], [7.444444444444445, 4.666666666666667]] Updated clusters: cluster 0 [5, 2] [2, 4] [4, 6] [5, 2] [1, 5] [4, 2] [4, 5] [1, 6] [4, 7] [3, 6] [1, 1] [2, 2] [2, 1] [1, 2] [1, 4] [2, 6] [3, 4] [1, 4] cluster 1 [9, 5] [6, 7] [6, 4] [9, 2] [8, 4] [8, 7] [7, 2] [7, 7] [7, 4] Iteration 8 Updated centers: [[2.5555555555555554, 3.8333333333333335], [7.444444444444445, 4.666666666666667]] Updated clusters: cluster 0 [5, 2] [2, 4] [4, 6] [5, 2] [1, 5] [4, 2] [4, 5] [1, 6] [4, 7] [3, 6] [1, 1] [2, 2] [2, 1] [1, 2] [1, 4] [2, 6] [3, 4] [1, 4] cluster 1 [9, 5] [6, 7] [6, 4] [9, 2] [8, 4] [8, 7] [7, 2] [7, 7] [7, 4] Iteration 9 Updated centers: [[2.5555555555555554, 3.8333333333333335], [7.444444444444445, 4.666666666666667]] Updated clusters: cluster 0 [5, 2] [2, 4] [4, 6] [5, 2] [1, 5] [4, 2] [4, 5] [1, 6] [4, 7] [3, 6] [1, 1] [2, 2] [2, 1] [1, 2] [1, 4] [2, 6] [3, 4] [1, 4] cluster 1 [9, 5] [6, 7] [6, 4] [9, 2] [8, 4] [8, 7] [7, 2] [7, 7] [7, 4]
As we can see from the output of the program, the clusters does not change after Iteration 2, which means that we have correctly assigned the clusters to the dataset.
Complete code
data = [ [5,2], [2,4], [9,5], [4,6], [5,2], [1,5], [6,7], [4,2], [6,4], [9,2], [4,5], [1,6], [4,7], [3,6], [1,1], [8,4], [8,7], [7,2], [2,2], [2,1], [1,2], [1,4], [2,6], [7,7], [7,4], [3,4], [1,4] ] x = [i[0] for i in data] y = [i[1] for i in data] import matplotlib.pyplot as plt plt.scatter(x,y) plt.show() import math def dist(center, point): d = 0.0 for i in range(0,len(point)): d += (center[i]-point[i])**2 return math.sqrt(d) def assignCenters(centers, dataset): clusters = [] for i in range(len(dataset)): distances = [] for center in centers: distances.append(dist(center, dataset[i])) temp = [z for z, val in enumerate(distances) if val==min(distances)] clusters.append(temp[0]) return clusters def mean_center(k, dataset, clusters): nCenters = [] for i in range(k): x = 0.0 y = 0.0 count = 0 for j in range(len(clusters)): if(i == clusters[j]): x += dataset[j][0] y += dataset[j][1] count += 1 x = x/count y = y/count nCenters.append([x,y]) return nCenters print("enter k") k = int(input()) centers = [] for i in range(k): print("enter center "+str(i)) temp = [int(x) for x in input().split()] centers.append(temp) print("Initial centers: ") print(centers) print("Initial clusters: ") clusters = assignCenters(centers, data) for i in range(k): print("cluster "+str(i)) for j in range(len(clusters)): if(i == clusters[j]): print(data[j],end=' ') print() print() for itr in range(10): print("Iteration "+str(itr)) centers = mean_center(k,data,clusters) print("Updated centers: ") print(centers) clusters = assignCenters(centers, data) print("Updated clusters: ") for i in range(k): print("cluster "+str(i)) for j in range(len(clusters)): if(i == clusters[j]): print(data[j],end=' ') print() print()
Other Machine Learning algorithms:
- Naive Bayes Classification
- K Nearest Neighbors
- Linear Regression
- K Means Clustering
- Apriori Algorithm
- Principal Component Analysis (PCA)
Let us know in the comments if you are having any questions regarding this machine learning algorithm.
And if you found this post helpful, then please help us by sharing this post with your friends. Thank You