K Means Clustering Program in Python from Scratch

K-means clustering is a machine learning algorithm that is used to partition the data into K clusters in which each data belongs to the cluster with the nearest mean.

KMC works by calculating the distance of each data in the dataset to K randomly selected points. After this, each data point is assigned to the cluster to which it is nearest. Then, the mean is calculated in each cluster to find a new center for each cluster. Again, the process is repeated till the center remains the same for all the clusters.

In this post, we will write the program for the K-means clustering. We will use python to write this program and we will not use any libraries.

Input

We have the dataset given below. It consists of 2D points. We will use matplotlib to plot the points, although this is optional.

data = [
        [5,2],
        [2,4],
        [9,5],
        [4,6],
        [5,2],
        [1,5],
        [6,7],
        [4,2],
        [6,4],
        [9,2],
        [4,5],
        [1,6],
        [4,7],
        [3,6],
        [1,1],
        [8,4],
        [8,7],
        [7,2],
        [2,2],
        [2,1],
        [1,2],
        [1,4],
        [2,6],
        [7,7],
        [7,4],
        [3,4],
        [1,4]
        ]

x = [i[0] for i in data]
y = [i[1] for i in data]

import matplotlib.pyplot as plt

plt.scatter(x,y)
plt.show()
Plotted points

Distance function

Now, we will define a function to calculate the distance between two points.

import math
def dist(center, point):
    d = 0.0
    for i in range(0,len(point)):
        d += (center[i]-point[i])**2
    return math.sqrt(d)

Defining functions

We will now define the functions to assign centers to the dataset and also to find the mean of the clusters.

def assignCenters(centers, dataset):
    clusters = []
    for i in range(len(dataset)):
        distances = []
        for center in centers:
            distances.append(dist(center, dataset[i]))
        temp = [z for z, val in enumerate(distances) if val==min(distances)]
        clusters.append(temp[0])
    return clusters

def mean_center(k, dataset, clusters):
    nCenters = []
    for i in range(k):
        x = 0.0
        y = 0.0
        count = 0
        for j in range(len(clusters)):
            if(i == clusters[j]):
                x += dataset[j][0]
                y += dataset[j][1]
                count += 1
        x = x/count
        y = y/count
        nCenters.append([x,y])
    return nCenters

Inputting initial points

Now we will input the initial points and value of K from the user.

print("enter k")
k = int(input())
centers = []
for i in range(k):
    print("enter center "+str(i))
    temp = [int(x) for x in input().split()]
    centers.append(temp)
enter k
2
enter center 0
6 4
enter center 1
9 2

Algorithm

Now, we will apply the K-means clustering algorithm.

print("Initial centers: ")
print(centers)
print("Initial clusters: ")
clusters = assignCenters(centers, data)
for i in range(k):
    print("cluster "+str(i))
    for j in range(len(clusters)):
        if(i == clusters[j]):
            print(data[j],end=' ')
    print()
print()
for itr in range(10):
    print("Iteration "+str(itr))
    centers = mean_center(k,data,clusters)
    print("Updated centers: ")
    print(centers)
    clusters = assignCenters(centers, data)
    print("Updated clusters: ")
    for i in range(k):
        print("cluster "+str(i))
        for j in range(len(clusters)):
            if(i == clusters[j]):
                print(data[j],end=' ')
        print()
    print()
Initial centers: 
[[6, 4], [9, 2]]
Initial clusters: 
cluster 0
[5, 2] [2, 4] [4, 6] [5, 2] [1, 5] [6, 7] [4, 2] [6, 4] [4, 5] [1, 6] [4, 7] [3, 6] [1, 1] [8, 4] [8, 7] [2, 2] [2, 1] [1, 2] [1, 4] [2, 6] [7, 7] [7, 4] [3, 4] [1, 4] 
cluster 1
[9, 5] [9, 2] [7, 2] 

Iteration 0
Updated centers: 
[[3.6666666666666665, 4.25], [8.333333333333334, 3.0]]
Updated clusters: 
cluster 0
[5, 2] [2, 4] [4, 6] [5, 2] [1, 5] [6, 7] [4, 2] [6, 4] [4, 5] [1, 6] [4, 7] [3, 6] [1, 1] [2, 2] [2, 1] [1, 2] [1, 4] [2, 6] [3, 4] [1, 4] 
cluster 1
[9, 5] [9, 2] [8, 4] [8, 7] [7, 2] [7, 7] [7, 4] 

Iteration 1
Updated centers: 
[[2.9, 4.0], [7.857142857142857, 4.428571428571429]]
Updated clusters: 
cluster 0
[5, 2] [2, 4] [4, 6] [5, 2] [1, 5] [4, 2] [4, 5] [1, 6] [4, 7] [3, 6] [1, 1] [2, 2] [2, 1] [1, 2] [1, 4] [2, 6] [3, 4] [1, 4] 
cluster 1
[9, 5] [6, 7] [6, 4] [9, 2] [8, 4] [8, 7] [7, 2] [7, 7] [7, 4] 

Iteration 2
Updated centers: 
[[2.5555555555555554, 3.8333333333333335], [7.444444444444445, 4.666666666666667]]
Updated clusters: 
cluster 0
[5, 2] [2, 4] [4, 6] [5, 2] [1, 5] [4, 2] [4, 5] [1, 6] [4, 7] [3, 6] [1, 1] [2, 2] [2, 1] [1, 2] [1, 4] [2, 6] [3, 4] [1, 4] 
cluster 1
[9, 5] [6, 7] [6, 4] [9, 2] [8, 4] [8, 7] [7, 2] [7, 7] [7, 4] 

Iteration 3
Updated centers: 
[[2.5555555555555554, 3.8333333333333335], [7.444444444444445, 4.666666666666667]]
Updated clusters: 
cluster 0
[5, 2] [2, 4] [4, 6] [5, 2] [1, 5] [4, 2] [4, 5] [1, 6] [4, 7] [3, 6] [1, 1] [2, 2] [2, 1] [1, 2] [1, 4] [2, 6] [3, 4] [1, 4] 
cluster 1
[9, 5] [6, 7] [6, 4] [9, 2] [8, 4] [8, 7] [7, 2] [7, 7] [7, 4] 

Iteration 4
Updated centers: 
[[2.5555555555555554, 3.8333333333333335], [7.444444444444445, 4.666666666666667]]
Updated clusters: 
cluster 0
[5, 2] [2, 4] [4, 6] [5, 2] [1, 5] [4, 2] [4, 5] [1, 6] [4, 7] [3, 6] [1, 1] [2, 2] [2, 1] [1, 2] [1, 4] [2, 6] [3, 4] [1, 4] 
cluster 1
[9, 5] [6, 7] [6, 4] [9, 2] [8, 4] [8, 7] [7, 2] [7, 7] [7, 4] 

Iteration 5
Updated centers: 
[[2.5555555555555554, 3.8333333333333335], [7.444444444444445, 4.666666666666667]]
Updated clusters: 
cluster 0
[5, 2] [2, 4] [4, 6] [5, 2] [1, 5] [4, 2] [4, 5] [1, 6] [4, 7] [3, 6] [1, 1] [2, 2] [2, 1] [1, 2] [1, 4] [2, 6] [3, 4] [1, 4] 
cluster 1
[9, 5] [6, 7] [6, 4] [9, 2] [8, 4] [8, 7] [7, 2] [7, 7] [7, 4] 

Iteration 6
Updated centers: 
[[2.5555555555555554, 3.8333333333333335], [7.444444444444445, 4.666666666666667]]
Updated clusters: 
cluster 0
[5, 2] [2, 4] [4, 6] [5, 2] [1, 5] [4, 2] [4, 5] [1, 6] [4, 7] [3, 6] [1, 1] [2, 2] [2, 1] [1, 2] [1, 4] [2, 6] [3, 4] [1, 4] 
cluster 1
[9, 5] [6, 7] [6, 4] [9, 2] [8, 4] [8, 7] [7, 2] [7, 7] [7, 4] 

Iteration 7
Updated centers: 
[[2.5555555555555554, 3.8333333333333335], [7.444444444444445, 4.666666666666667]]
Updated clusters: 
cluster 0
[5, 2] [2, 4] [4, 6] [5, 2] [1, 5] [4, 2] [4, 5] [1, 6] [4, 7] [3, 6] [1, 1] [2, 2] [2, 1] [1, 2] [1, 4] [2, 6] [3, 4] [1, 4] 
cluster 1
[9, 5] [6, 7] [6, 4] [9, 2] [8, 4] [8, 7] [7, 2] [7, 7] [7, 4] 

Iteration 8
Updated centers: 
[[2.5555555555555554, 3.8333333333333335], [7.444444444444445, 4.666666666666667]]
Updated clusters: 
cluster 0
[5, 2] [2, 4] [4, 6] [5, 2] [1, 5] [4, 2] [4, 5] [1, 6] [4, 7] [3, 6] [1, 1] [2, 2] [2, 1] [1, 2] [1, 4] [2, 6] [3, 4] [1, 4] 
cluster 1
[9, 5] [6, 7] [6, 4] [9, 2] [8, 4] [8, 7] [7, 2] [7, 7] [7, 4] 

Iteration 9
Updated centers: 
[[2.5555555555555554, 3.8333333333333335], [7.444444444444445, 4.666666666666667]]
Updated clusters: 
cluster 0
[5, 2] [2, 4] [4, 6] [5, 2] [1, 5] [4, 2] [4, 5] [1, 6] [4, 7] [3, 6] [1, 1] [2, 2] [2, 1] [1, 2] [1, 4] [2, 6] [3, 4] [1, 4] 
cluster 1
[9, 5] [6, 7] [6, 4] [9, 2] [8, 4] [8, 7] [7, 2] [7, 7] [7, 4] 

As we can see from the output of the program, the clusters does not change after Iteration 2, which means that we have correctly assigned the clusters to the dataset.

Complete code

data = [
        [5,2],
        [2,4],
        [9,5],
        [4,6],
        [5,2],
        [1,5],
        [6,7],
        [4,2],
        [6,4],
        [9,2],
        [4,5],
        [1,6],
        [4,7],
        [3,6],
        [1,1],
        [8,4],
        [8,7],
        [7,2],
        [2,2],
        [2,1],
        [1,2],
        [1,4],
        [2,6],
        [7,7],
        [7,4],
        [3,4],
        [1,4]
        ]

x = [i[0] for i in data]
y = [i[1] for i in data]

import matplotlib.pyplot as plt

plt.scatter(x,y)
plt.show()

import math

def dist(center, point):
    d = 0.0
    for i in range(0,len(point)):
        d += (center[i]-point[i])**2
    return math.sqrt(d)

def assignCenters(centers, dataset):
    clusters = []
    for i in range(len(dataset)):
        distances = []
        for center in centers:
            distances.append(dist(center, dataset[i]))
        temp = [z for z, val in enumerate(distances) if val==min(distances)]
        clusters.append(temp[0])
    return clusters

def mean_center(k, dataset, clusters):
    nCenters = []
    for i in range(k):
        x = 0.0
        y = 0.0
        count = 0
        for j in range(len(clusters)):
            if(i == clusters[j]):
                x += dataset[j][0]
                y += dataset[j][1]
                count += 1
        x = x/count
        y = y/count
        nCenters.append([x,y])
    return nCenters

print("enter k")
k = int(input())
centers = []
for i in range(k):
    print("enter center "+str(i))
    temp = [int(x) for x in input().split()]
    centers.append(temp)

print("Initial centers: ")
print(centers)
print("Initial clusters: ")
clusters = assignCenters(centers, data)
for i in range(k):
    print("cluster "+str(i))
    for j in range(len(clusters)):
        if(i == clusters[j]):
            print(data[j],end=' ')
    print()
print()
for itr in range(10):
    print("Iteration "+str(itr))
    centers = mean_center(k,data,clusters)
    print("Updated centers: ")
    print(centers)
    clusters = assignCenters(centers, data)
    print("Updated clusters: ")
    for i in range(k):
        print("cluster "+str(i))
        for j in range(len(clusters)):
            if(i == clusters[j]):
                print(data[j],end=' ')
        print()
    print()

Other Machine Learning algorithms:

Let us know in the comments if you are having any questions regarding this machine learning algorithm.

And if you found this post helpful, then please help us by sharing this post with your friends. Thank You

Leave a Reply