Principal Component Analysis (PCA) is a machine learning algorithm for dimensionality reduction.
It uses matrix operations from statistics and algebra to find the dimensions that contribute the most to the variance of the data. Hence, reducing the training time.
For example, suppose we have a dataset that has 1000 features. If we want to train this dataset, it will take a lot of time since there are 1000 dimensions. Now, what if we could choose only the important features from our dataset that contribute to most of the variance in our data. Thus, we could discard the features that are not that important without affecting the accuracy that much. For this, we can use the Principal Component Analysis technique and select the top k features from our data, and hence reducing the training time.
In this post, we will write the program for the PCA algorithm. We will use python to write this program and we will not use any libraries.
Input
We have the dataset given below. It consists of items with 4 features. First, we will convert the dataset into a NumPy array.
import numpy as np data = [[5.1, 3.5, 1.4, 0.2], [4.9, 3.0, 1.4, 0.2], [4.7, 3.2, 1.3, 0.2], [4.6, 3.1, 1.5, 0.2], [5.0, 3.6, 1.4, 0.2], [5.4, 3.9, 1.7, 0.4], [4.6, 3.4, 1.4, 0.3], [5.0, 3.4, 1.5, 0.2], [4.4, 2.9, 1.4, 0.2], [4.9, 3.1, 1.5, 0.1], [5.4, 3.7, 1.5, 0.2], [4.8, 3.4, 1.6, 0.2], [4.8, 3.0, 1.4, 0.1], [4.3, 3.0, 1.1, 0.1], [5.8, 4.0, 1.2, 0.2], [5.7, 4.4, 1.5, 0.4], [5.4, 3.9, 1.3, 0.4], [5.1, 3.5, 1.4, 0.3], [5.7, 3.8, 1.7, 0.3], [5.1, 3.8, 1.5, 0.3]] odata = np.copy(np.array(data)) print(np.array(data))
[[5.1 3.5 1.4 0.2] [4.9 3. 1.4 0.2] [4.7 3.2 1.3 0.2] [4.6 3.1 1.5 0.2] [5. 3.6 1.4 0.2] [5.4 3.9 1.7 0.4] [4.6 3.4 1.4 0.3] [5. 3.4 1.5 0.2] [4.4 2.9 1.4 0.2] [4.9 3.1 1.5 0.1] [5.4 3.7 1.5 0.2] [4.8 3.4 1.6 0.2] [4.8 3. 1.4 0.1] [4.3 3. 1.1 0.1] [5.8 4. 1.2 0.2] [5.7 4.4 1.5 0.4] [5.4 3.9 1.3 0.4] [5.1 3.5 1.4 0.3] [5.7 3.8 1.7 0.3] [5.1 3.8 1.5 0.3]]
Subtract the mean
Next, we will subtract the mean from the dataset for better accuracy.
for j in range(len(data[0])): sum = 0 for i in range(len(data)): sum += data[i][j] for i in range(len(data)): data[i][j] -= sum/len(data) print(np.array(data))
[[ 0.065 0.02 -0.035 -0.035] [-0.135 -0.48 -0.035 -0.035] [-0.335 -0.28 -0.135 -0.035] [-0.435 -0.38 0.065 -0.035] [-0.035 0.12 -0.035 -0.035] [ 0.365 0.42 0.265 0.165] [-0.435 -0.08 -0.035 0.065] [-0.035 -0.08 0.065 -0.035] [-0.635 -0.58 -0.035 -0.035] [-0.135 -0.38 0.065 -0.135] [ 0.365 0.22 0.065 -0.035] [-0.235 -0.08 0.165 -0.035] [-0.235 -0.48 -0.035 -0.135] [-0.735 -0.48 -0.335 -0.135] [ 0.765 0.52 -0.235 -0.035] [ 0.665 0.92 0.065 0.165] [ 0.365 0.42 -0.135 0.165] [ 0.065 0.02 -0.035 0.065] [ 0.665 0.32 0.265 0.065] [ 0.065 0.32 0.065 0.065]]
Covariance matrix
Next, we will calculate the covariance matrix.
data = np.array(data) covdata = np.cov(data.T) print(covdata)
[[0.18239474 0.15231579 0.01976316 0.02239474] [0.15231579 0.16589474 0.01547368 0.02863158] [0.01976316 0.01547368 0.02134211 0.00502632] [0.02239474 0.02863158 0.00502632 0.00871053]]
Eigen values and eigen vectors
Next, we will calculate the eigen values and eigen vectors from the covariance matrix.
values, vectors = np.linalg.eig(covdata) print(values) print() print(vectors)
[0.33276835 0.002671 0.02383619 0.01906657] [[ 0.71816179 0.131601 0.61745716 -0.2926969 ] [ 0.68211748 -0.27163784 -0.65996887 0.15927874] [ 0.08126075 -0.16686365 0.37215116 0.90942659] [ 0.1111579 0.93864295 -0.21140307 0.24880129]]
Choosing components and forming a feature vector
Now, we will choose the components which have the most value in the eigenvalues array. We will choose the k top values, and choose the component for the new dataset according to it.
var = [] for i in range(len(values)): var.append(values[i] / np.sum(values)) print(var)
[0.8795435298377663, 0.007059744713736781, 0.06300167349739227, 0.0503950519511047]
k = 2 s = np.argsort(np.array(var)) s = s[::-1] s = s[0:k] s = sorted(s) ndata = [] odata = np.array(odata).T for i in s: ndata.append(odata[i]) print("New data:") print(np.array(ndata).T)
New data: [[5.1 1.4] [4.9 1.4] [4.7 1.3] [4.6 1.5] [5. 1.4] [5.4 1.7] [4.6 1.4] [5. 1.5] [4.4 1.4] [4.9 1.5] [5.4 1.5] [4.8 1.6] [4.8 1.4] [4.3 1.1] [5.8 1.2] [5.7 1.5] [5.4 1.3] [5.1 1.4] [5.7 1.7] [5.1 1.5]]
That’s it. Below is the complete code for the PCA algorithm.
Complete code
import numpy as np data = [[5.1, 3.5, 1.4, 0.2], [4.9, 3.0, 1.4, 0.2], [4.7, 3.2, 1.3, 0.2], [4.6, 3.1, 1.5, 0.2], [5.0, 3.6, 1.4, 0.2], [5.4, 3.9, 1.7, 0.4], [4.6, 3.4, 1.4, 0.3], [5.0, 3.4, 1.5, 0.2], [4.4, 2.9, 1.4, 0.2], [4.9, 3.1, 1.5, 0.1], [5.4, 3.7, 1.5, 0.2], [4.8, 3.4, 1.6, 0.2], [4.8, 3.0, 1.4, 0.1], [4.3, 3.0, 1.1, 0.1], [5.8, 4.0, 1.2, 0.2], [5.7, 4.4, 1.5, 0.4], [5.4, 3.9, 1.3, 0.4], [5.1, 3.5, 1.4, 0.3], [5.7, 3.8, 1.7, 0.3], [5.1, 3.8, 1.5, 0.3]] odata = np.copy(np.array(data)) print(np.array(data)) for j in range(len(data[0])): sum = 0 for i in range(len(data)): sum += data[i][j] for i in range(len(data)): data[i][j] -= sum/len(data) print(np.array(data)) data = np.array(data) covdata = np.cov(data.T) print(covdata) values, vectors = np.linalg.eig(covdata) print(values) print() print(vectors) var = [] for i in range(len(values)): var.append(values[i] / np.sum(values)) print(var) k = 2 s = np.argsort(np.array(var)) s = s[::-1] s = s[0:k] s = sorted(s) ndata = [] odata = np.array(odata).T for i in s: ndata.append(odata[i]) print("New data:") print(np.array(ndata).T)
Other Machine Learning algorithms:
- Naive Bayes Classification
- K Nearest Neighbors
- Linear Regression
- K Means Clustering
- Apriori Algorithm
- Principal Component Analysis (PCA)
Let us know in the comments if you are having any questions regarding this machine learning algorithm.
And if you found this post helpful, then please help us by sharing this post with your friends. Thank You