Principal Component Analysis (PCA) Program in Python from Scratch

Principal Component Analysis (PCA) is a machine learning algorithm for dimensionality reduction.

It uses matrix operations from statistics and algebra to find the dimensions that contribute the most to the variance of the data. Hence, reducing the training time.

For example, suppose we have a dataset that has 1000 features. If we want to train this dataset, it will take a lot of time since there are 1000 dimensions. Now, what if we could choose only the important features from our dataset that contribute to most of the variance in our data. Thus, we could discard the features that are not that important without affecting the accuracy that much. For this, we can use the Principal Component Analysis technique and select the top k features from our data, and hence reducing the training time.

In this post, we will write the program for the PCA algorithm. We will use python to write this program and we will not use any libraries.

Input

We have the dataset given below. It consists of items with 4 features. First, we will convert the dataset into a NumPy array.

```import numpy as np
data = [[5.1, 3.5, 1.4, 0.2],
[4.9, 3.0, 1.4, 0.2],
[4.7, 3.2, 1.3, 0.2],
[4.6, 3.1, 1.5, 0.2],
[5.0, 3.6, 1.4, 0.2],
[5.4, 3.9, 1.7, 0.4],
[4.6, 3.4, 1.4, 0.3],
[5.0, 3.4, 1.5, 0.2],
[4.4, 2.9, 1.4, 0.2],
[4.9, 3.1, 1.5, 0.1],
[5.4, 3.7, 1.5, 0.2],
[4.8, 3.4, 1.6, 0.2],
[4.8, 3.0, 1.4, 0.1],
[4.3, 3.0, 1.1, 0.1],
[5.8, 4.0, 1.2, 0.2],
[5.7, 4.4, 1.5, 0.4],
[5.4, 3.9, 1.3, 0.4],
[5.1, 3.5, 1.4, 0.3],
[5.7, 3.8, 1.7, 0.3],
[5.1, 3.8, 1.5, 0.3]]
odata = np.copy(np.array(data))
print(np.array(data))```
```[[5.1 3.5 1.4 0.2]
[4.9 3.  1.4 0.2]
[4.7 3.2 1.3 0.2]
[4.6 3.1 1.5 0.2]
[5.  3.6 1.4 0.2]
[5.4 3.9 1.7 0.4]
[4.6 3.4 1.4 0.3]
[5.  3.4 1.5 0.2]
[4.4 2.9 1.4 0.2]
[4.9 3.1 1.5 0.1]
[5.4 3.7 1.5 0.2]
[4.8 3.4 1.6 0.2]
[4.8 3.  1.4 0.1]
[4.3 3.  1.1 0.1]
[5.8 4.  1.2 0.2]
[5.7 4.4 1.5 0.4]
[5.4 3.9 1.3 0.4]
[5.1 3.5 1.4 0.3]
[5.7 3.8 1.7 0.3]
[5.1 3.8 1.5 0.3]]```

Subtract the mean

Next, we will subtract the mean from the dataset for better accuracy.

```for j in range(len(data[0])):
sum = 0
for i in range(len(data)):
sum += data[i][j]
for i in range(len(data)):
data[i][j] -= sum/len(data)
print(np.array(data))```
```[[ 0.065  0.02  -0.035 -0.035]
[-0.135 -0.48  -0.035 -0.035]
[-0.335 -0.28  -0.135 -0.035]
[-0.435 -0.38   0.065 -0.035]
[-0.035  0.12  -0.035 -0.035]
[ 0.365  0.42   0.265  0.165]
[-0.435 -0.08  -0.035  0.065]
[-0.035 -0.08   0.065 -0.035]
[-0.635 -0.58  -0.035 -0.035]
[-0.135 -0.38   0.065 -0.135]
[ 0.365  0.22   0.065 -0.035]
[-0.235 -0.08   0.165 -0.035]
[-0.235 -0.48  -0.035 -0.135]
[-0.735 -0.48  -0.335 -0.135]
[ 0.765  0.52  -0.235 -0.035]
[ 0.665  0.92   0.065  0.165]
[ 0.365  0.42  -0.135  0.165]
[ 0.065  0.02  -0.035  0.065]
[ 0.665  0.32   0.265  0.065]
[ 0.065  0.32   0.065  0.065]]```

Covariance matrix

Next, we will calculate the covariance matrix.

```data = np.array(data)
covdata = np.cov(data.T)
print(covdata)```
```[[0.18239474 0.15231579 0.01976316 0.02239474]
[0.15231579 0.16589474 0.01547368 0.02863158]
[0.01976316 0.01547368 0.02134211 0.00502632]
[0.02239474 0.02863158 0.00502632 0.00871053]]```

Eigen values and eigen vectors

Next, we will calculate the eigen values and eigen vectors from the covariance matrix.

```values, vectors = np.linalg.eig(covdata)
print(values)
print()
print(vectors)```
```[0.33276835 0.002671   0.02383619 0.01906657]

[[ 0.71816179  0.131601    0.61745716 -0.2926969 ]
[ 0.68211748 -0.27163784 -0.65996887  0.15927874]
[ 0.08126075 -0.16686365  0.37215116  0.90942659]
[ 0.1111579   0.93864295 -0.21140307  0.24880129]]```

Choosing components and forming a feature vector

Now, we will choose the components which have the most value in the eigenvalues array. We will choose the k top values, and choose the component for the new dataset according to it.

```var = []
for i in range(len(values)):
var.append(values[i] / np.sum(values))
print(var)```
`[0.8795435298377663, 0.007059744713736781, 0.06300167349739227, 0.0503950519511047]`
```k = 2
s = np.argsort(np.array(var))
s = s[::-1]
s = s[0:k]
s = sorted(s)
ndata = []
odata = np.array(odata).T
for i in s:
ndata.append(odata[i])
print("New data:")
print(np.array(ndata).T)```
```New data:
[[5.1 1.4]
[4.9 1.4]
[4.7 1.3]
[4.6 1.5]
[5.  1.4]
[5.4 1.7]
[4.6 1.4]
[5.  1.5]
[4.4 1.4]
[4.9 1.5]
[5.4 1.5]
[4.8 1.6]
[4.8 1.4]
[4.3 1.1]
[5.8 1.2]
[5.7 1.5]
[5.4 1.3]
[5.1 1.4]
[5.7 1.7]
[5.1 1.5]]```

That’s it. Below is the complete code for the PCA algorithm.

Complete code

```import numpy as np
data = [[5.1, 3.5, 1.4, 0.2],
[4.9, 3.0, 1.4, 0.2],
[4.7, 3.2, 1.3, 0.2],
[4.6, 3.1, 1.5, 0.2],
[5.0, 3.6, 1.4, 0.2],
[5.4, 3.9, 1.7, 0.4],
[4.6, 3.4, 1.4, 0.3],
[5.0, 3.4, 1.5, 0.2],
[4.4, 2.9, 1.4, 0.2],
[4.9, 3.1, 1.5, 0.1],
[5.4, 3.7, 1.5, 0.2],
[4.8, 3.4, 1.6, 0.2],
[4.8, 3.0, 1.4, 0.1],
[4.3, 3.0, 1.1, 0.1],
[5.8, 4.0, 1.2, 0.2],
[5.7, 4.4, 1.5, 0.4],
[5.4, 3.9, 1.3, 0.4],
[5.1, 3.5, 1.4, 0.3],
[5.7, 3.8, 1.7, 0.3],
[5.1, 3.8, 1.5, 0.3]]
odata = np.copy(np.array(data))

print(np.array(data))

for j in range(len(data[0])):
sum = 0
for i in range(len(data)):
sum += data[i][j]
for i in range(len(data)):
data[i][j] -= sum/len(data)
print(np.array(data))

data = np.array(data)
covdata = np.cov(data.T)
print(covdata)

values, vectors = np.linalg.eig(covdata)
print(values)
print()
print(vectors)

var = []
for i in range(len(values)):
var.append(values[i] / np.sum(values))
print(var)

k = 2
s = np.argsort(np.array(var))
s = s[::-1]
s = s[0:k]
s = sorted(s)
ndata = []
odata = np.array(odata).T
for i in s:
ndata.append(odata[i])
print("New data:")
print(np.array(ndata).T)```

Other Machine Learning algorithms:

Let us know in the comments if you are having any questions regarding this machine learning algorithm.