# K Means Clustering Program in Python from Scratch

K-means clustering is a machine learning algorithm that is used to partition the data into K clusters in which each data belongs to the cluster with the nearest mean.

KMC works by calculating the distance of each data in the dataset to K randomly selected points. After this, each data point is assigned to the cluster to which it is nearest. Then, the mean is calculated in each cluster to find a new center for each cluster. Again, the process is repeated till the center remains the same for all the clusters.

In this post, we will write the program for the K-means clustering. We will use python to write this program and we will not use any libraries.

### Input

We have the dataset given below. It consists of 2D points. We will use matplotlib to plot the points, although this is optional.

```data = [
[5,2],
[2,4],
[9,5],
[4,6],
[5,2],
[1,5],
[6,7],
[4,2],
[6,4],
[9,2],
[4,5],
[1,6],
[4,7],
[3,6],
[1,1],
[8,4],
[8,7],
[7,2],
[2,2],
[2,1],
[1,2],
[1,4],
[2,6],
[7,7],
[7,4],
[3,4],
[1,4]
]

x = [i[0] for i in data]
y = [i[1] for i in data]

import matplotlib.pyplot as plt

plt.scatter(x,y)
plt.show()```

### Distance function

Now, we will define a function to calculate the distance between two points.

```import math
def dist(center, point):
d = 0.0
for i in range(0,len(point)):
d += (center[i]-point[i])**2
return math.sqrt(d)```

### Defining functions

We will now define the functions to assign centers to the dataset and also to find the mean of the clusters.

```def assignCenters(centers, dataset):
clusters = []
for i in range(len(dataset)):
distances = []
for center in centers:
distances.append(dist(center, dataset[i]))
temp = [z for z, val in enumerate(distances) if val==min(distances)]
clusters.append(temp[0])
return clusters

def mean_center(k, dataset, clusters):
nCenters = []
for i in range(k):
x = 0.0
y = 0.0
count = 0
for j in range(len(clusters)):
if(i == clusters[j]):
x += dataset[j][0]
y += dataset[j][1]
count += 1
x = x/count
y = y/count
nCenters.append([x,y])
return nCenters```

### Inputting initial points

Now we will input the initial points and value of K from the user.

```print("enter k")
k = int(input())
centers = []
for i in range(k):
print("enter center "+str(i))
temp = [int(x) for x in input().split()]
centers.append(temp)```
```enter k
2
enter center 0
6 4
enter center 1
9 2```

### Algorithm

Now, we will apply the K-means clustering algorithm.

```print("Initial centers: ")
print(centers)
print("Initial clusters: ")
clusters = assignCenters(centers, data)
for i in range(k):
print("cluster "+str(i))
for j in range(len(clusters)):
if(i == clusters[j]):
print(data[j],end=' ')
print()
print()
for itr in range(10):
print("Iteration "+str(itr))
centers = mean_center(k,data,clusters)
print("Updated centers: ")
print(centers)
clusters = assignCenters(centers, data)
print("Updated clusters: ")
for i in range(k):
print("cluster "+str(i))
for j in range(len(clusters)):
if(i == clusters[j]):
print(data[j],end=' ')
print()
print()```
```Initial centers:
[[6, 4], [9, 2]]
Initial clusters:
cluster 0
[5, 2] [2, 4] [4, 6] [5, 2] [1, 5] [6, 7] [4, 2] [6, 4] [4, 5] [1, 6] [4, 7] [3, 6] [1, 1] [8, 4] [8, 7] [2, 2] [2, 1] [1, 2] [1, 4] [2, 6] [7, 7] [7, 4] [3, 4] [1, 4]
cluster 1
[9, 5] [9, 2] [7, 2]

Iteration 0
Updated centers:
[[3.6666666666666665, 4.25], [8.333333333333334, 3.0]]
Updated clusters:
cluster 0
[5, 2] [2, 4] [4, 6] [5, 2] [1, 5] [6, 7] [4, 2] [6, 4] [4, 5] [1, 6] [4, 7] [3, 6] [1, 1] [2, 2] [2, 1] [1, 2] [1, 4] [2, 6] [3, 4] [1, 4]
cluster 1
[9, 5] [9, 2] [8, 4] [8, 7] [7, 2] [7, 7] [7, 4]

Iteration 1
Updated centers:
[[2.9, 4.0], [7.857142857142857, 4.428571428571429]]
Updated clusters:
cluster 0
[5, 2] [2, 4] [4, 6] [5, 2] [1, 5] [4, 2] [4, 5] [1, 6] [4, 7] [3, 6] [1, 1] [2, 2] [2, 1] [1, 2] [1, 4] [2, 6] [3, 4] [1, 4]
cluster 1
[9, 5] [6, 7] [6, 4] [9, 2] [8, 4] [8, 7] [7, 2] [7, 7] [7, 4]

Iteration 2
Updated centers:
[[2.5555555555555554, 3.8333333333333335], [7.444444444444445, 4.666666666666667]]
Updated clusters:
cluster 0
[5, 2] [2, 4] [4, 6] [5, 2] [1, 5] [4, 2] [4, 5] [1, 6] [4, 7] [3, 6] [1, 1] [2, 2] [2, 1] [1, 2] [1, 4] [2, 6] [3, 4] [1, 4]
cluster 1
[9, 5] [6, 7] [6, 4] [9, 2] [8, 4] [8, 7] [7, 2] [7, 7] [7, 4]

Iteration 3
Updated centers:
[[2.5555555555555554, 3.8333333333333335], [7.444444444444445, 4.666666666666667]]
Updated clusters:
cluster 0
[5, 2] [2, 4] [4, 6] [5, 2] [1, 5] [4, 2] [4, 5] [1, 6] [4, 7] [3, 6] [1, 1] [2, 2] [2, 1] [1, 2] [1, 4] [2, 6] [3, 4] [1, 4]
cluster 1
[9, 5] [6, 7] [6, 4] [9, 2] [8, 4] [8, 7] [7, 2] [7, 7] [7, 4]

Iteration 4
Updated centers:
[[2.5555555555555554, 3.8333333333333335], [7.444444444444445, 4.666666666666667]]
Updated clusters:
cluster 0
[5, 2] [2, 4] [4, 6] [5, 2] [1, 5] [4, 2] [4, 5] [1, 6] [4, 7] [3, 6] [1, 1] [2, 2] [2, 1] [1, 2] [1, 4] [2, 6] [3, 4] [1, 4]
cluster 1
[9, 5] [6, 7] [6, 4] [9, 2] [8, 4] [8, 7] [7, 2] [7, 7] [7, 4]

Iteration 5
Updated centers:
[[2.5555555555555554, 3.8333333333333335], [7.444444444444445, 4.666666666666667]]
Updated clusters:
cluster 0
[5, 2] [2, 4] [4, 6] [5, 2] [1, 5] [4, 2] [4, 5] [1, 6] [4, 7] [3, 6] [1, 1] [2, 2] [2, 1] [1, 2] [1, 4] [2, 6] [3, 4] [1, 4]
cluster 1
[9, 5] [6, 7] [6, 4] [9, 2] [8, 4] [8, 7] [7, 2] [7, 7] [7, 4]

Iteration 6
Updated centers:
[[2.5555555555555554, 3.8333333333333335], [7.444444444444445, 4.666666666666667]]
Updated clusters:
cluster 0
[5, 2] [2, 4] [4, 6] [5, 2] [1, 5] [4, 2] [4, 5] [1, 6] [4, 7] [3, 6] [1, 1] [2, 2] [2, 1] [1, 2] [1, 4] [2, 6] [3, 4] [1, 4]
cluster 1
[9, 5] [6, 7] [6, 4] [9, 2] [8, 4] [8, 7] [7, 2] [7, 7] [7, 4]

Iteration 7
Updated centers:
[[2.5555555555555554, 3.8333333333333335], [7.444444444444445, 4.666666666666667]]
Updated clusters:
cluster 0
[5, 2] [2, 4] [4, 6] [5, 2] [1, 5] [4, 2] [4, 5] [1, 6] [4, 7] [3, 6] [1, 1] [2, 2] [2, 1] [1, 2] [1, 4] [2, 6] [3, 4] [1, 4]
cluster 1
[9, 5] [6, 7] [6, 4] [9, 2] [8, 4] [8, 7] [7, 2] [7, 7] [7, 4]

Iteration 8
Updated centers:
[[2.5555555555555554, 3.8333333333333335], [7.444444444444445, 4.666666666666667]]
Updated clusters:
cluster 0
[5, 2] [2, 4] [4, 6] [5, 2] [1, 5] [4, 2] [4, 5] [1, 6] [4, 7] [3, 6] [1, 1] [2, 2] [2, 1] [1, 2] [1, 4] [2, 6] [3, 4] [1, 4]
cluster 1
[9, 5] [6, 7] [6, 4] [9, 2] [8, 4] [8, 7] [7, 2] [7, 7] [7, 4]

Iteration 9
Updated centers:
[[2.5555555555555554, 3.8333333333333335], [7.444444444444445, 4.666666666666667]]
Updated clusters:
cluster 0
[5, 2] [2, 4] [4, 6] [5, 2] [1, 5] [4, 2] [4, 5] [1, 6] [4, 7] [3, 6] [1, 1] [2, 2] [2, 1] [1, 2] [1, 4] [2, 6] [3, 4] [1, 4]
cluster 1
[9, 5] [6, 7] [6, 4] [9, 2] [8, 4] [8, 7] [7, 2] [7, 7] [7, 4] ```

As we can see from the output of the program, the clusters does not change after Iteration 2, which means that we have correctly assigned the clusters to the dataset.

### Complete code

```data = [
[5,2],
[2,4],
[9,5],
[4,6],
[5,2],
[1,5],
[6,7],
[4,2],
[6,4],
[9,2],
[4,5],
[1,6],
[4,7],
[3,6],
[1,1],
[8,4],
[8,7],
[7,2],
[2,2],
[2,1],
[1,2],
[1,4],
[2,6],
[7,7],
[7,4],
[3,4],
[1,4]
]

x = [i[0] for i in data]
y = [i[1] for i in data]

import matplotlib.pyplot as plt

plt.scatter(x,y)
plt.show()

import math

def dist(center, point):
d = 0.0
for i in range(0,len(point)):
d += (center[i]-point[i])**2
return math.sqrt(d)

def assignCenters(centers, dataset):
clusters = []
for i in range(len(dataset)):
distances = []
for center in centers:
distances.append(dist(center, dataset[i]))
temp = [z for z, val in enumerate(distances) if val==min(distances)]
clusters.append(temp[0])
return clusters

def mean_center(k, dataset, clusters):
nCenters = []
for i in range(k):
x = 0.0
y = 0.0
count = 0
for j in range(len(clusters)):
if(i == clusters[j]):
x += dataset[j][0]
y += dataset[j][1]
count += 1
x = x/count
y = y/count
nCenters.append([x,y])
return nCenters

print("enter k")
k = int(input())
centers = []
for i in range(k):
print("enter center "+str(i))
temp = [int(x) for x in input().split()]
centers.append(temp)

print("Initial centers: ")
print(centers)
print("Initial clusters: ")
clusters = assignCenters(centers, data)
for i in range(k):
print("cluster "+str(i))
for j in range(len(clusters)):
if(i == clusters[j]):
print(data[j],end=' ')
print()
print()
for itr in range(10):
print("Iteration "+str(itr))
centers = mean_center(k,data,clusters)
print("Updated centers: ")
print(centers)
clusters = assignCenters(centers, data)
print("Updated clusters: ")
for i in range(k):
print("cluster "+str(i))
for j in range(len(clusters)):
if(i == clusters[j]):
print(data[j],end=' ')
print()
print()```

### Other Machine Learning algorithms:

Let us know in the comments if you are having any questions regarding this machine learning algorithm.