Principal Component Analysis(PCA) implementation in Python
An important and basic dimensionality reduction technique in machine learning is Principal component analysis which uses concepts from statistics, linear algebra and matrix to reduce dimension of a dataset by verifying each features variance, spreads, patterns and correlation between them in such a way that important or significant data information will be retained.
Let us perform PCA on MNIST dataset which is a simple computer vision dataset which contains 28*28 pixels images of handwritten digits from 0 to 9. Each image is represented as row flattening matrix of 28*28=784 pixel matrix. Each image/data points are of 784 dimensions.
Step 1 : Dataset Loading: MNIST dataset is stored in CSV(Comma separated value) format.
Step 2: Data Standardization: We need to standardize the data to prevent biases in final outcome. Data standardize by removing mean and scaling to unit variance. Scikit Learn provide “StandardScalar” function which is used for data standardization. Basically centering and scaling happen individually on each feature by computing below formula:
Step 3: Calculate Covariance Matrix: The aim of this step is to understand how the variables of the input data set are varying from the mean with respect to each other, or in other words, to see if there is any relationship between them. Because sometimes, variables are highly correlated in such a way that they contain redundant information. So, in order to identify these correlations, we compute the covariance matrix.
Step 4: Calculate Eigen values and Eigen Vectors: Intuitively Eigen Values gives us the percentage of variance of the features and Eigen Vectors gives us the direction of the features . These are used to calculate principal components from covariance matrix. Principal components are new variables that are constructed as linear combinations of the initial variables. These combinations are done in such a way that the new variables are uncorrelated and most of the information within the initial variables are compressed into the first components. So, the idea is 784-dimensional data gives us 784 principal components, but PCA tries to put maximum possible information in the first component, then maximum remaining information in the second and so on. Basically principal components represent the directions of the data that explain a maximal amount of variance.