Kick out Curse of Dimensionality using Principal Component Analysis(PCA)

Abhishekrastogi
5 min readMay 11, 2021

Basically if we have huge amount of raw dataset with high dimension to work on, we mostly have lots of inconsistent and redundant features which will not be very useful for data but it may not only increase computation time but also increase complexities during exploratory data analysis and data processing. This phenomenon is called as Curse of Dimensionality. To Overcome this issue we should reduce/drop features which are not important, we can achieve this by using dimensionality reduction techniques. PCA is one of the simplest dimensionality reduction technique used in the industry.

PCA basically verify each features variance, spreads, patterns and correlation between them to reduce dimension in such a way that important or significant data information will be retained. To implement this we need to understand different steps performed in PCA.

Step 1: Data Preprocessing: Column Normalization/Standardization: Column normalization and standardization are two different techniques to scale data in such a way that all the data points lie within a range. Column standardization are quite often used in industry. Using this will result in stopping the biased outcome.

Column Normalization: Let’s assume we have a feature f with data points (a1,a2,…..an). Let’s max(a) is maximum values among all data points and min(a) is the minimum value. We will create new set values as (a1', a2',…..an’) such that ai’ = (ai — min(a))/(max(a)) — min(a)) for all i where i belongs to 1….n. After doing this transformation all ai’s will lie between [0,1]. So basically column normalization squash data in such a way that all values lie between [0,1] irrespective of its scale.

Column Normalization

Column Standardization: Let’s assume we have a feature f with data points (a1,a2,…..an). Let’s max(a) is maximum values among all data points and min(a) is the minimum value. We will create new set values as (a1', a2',…..an’) such that mean and standard deviation of standardized data will be 0 and 1 respectively. To achieve this we calculate sample mean(a`) and sample standard deviation(s) and standardize ai’ such that ai’=(ai-a`)/s for all i where i belongs to 1….n. This will move all data points within a proper range.

Column Standardization

Step 2 : Computing Covariance Matrix: Covariance Matrix basically helps us to measure relationship between two features. Let’s say if we have 2 features X and Y then Covariance between them is calculated as below:

Formula for Covariance between X and Y features

There are 2 important properties of Covariance:

Cov(X,X) = Var(X)

Cov(X,Y) = Cov(Y,X)

So, Covariance matrix is a square symmetric matrix and if a feature is standardized as we saw in previous section then Cov(X,Y) = 1/n(Xi * Yj) as mean=0 in case of standardization. So, it is basically a dot product between matrix, as per matrix properties, Cov(X,Y) = 1/n(Transpose(Xi) Yj). Below are the points for Covariance matrix:

  1. The covariance value denotes how co-dependent two variables are with respect to each other.
  2. If Cov(X, Y) = Positive, it denotes features are directly proportional to each other.
  3. if Cov(X,Y) = Negative, it denotes features are inversely proportional to each other.

Step 3: Calculate Eigen Vectors and Eigen Values:

Eigen Values and Eigen vectors

if there is d* d matrix then each then we will have d eigen values and its corresponding eigen vectors such that each eigen values are greater than the previous one and each eigen vectors are perpendicular to each other(Transpose(xi)Xj=0).

Eigenvectors and eigenvalues are the mathematical constructs that must be computed from the covariance matrix in order to determine the principal components of the data set. Principal components are the new set of features that are obtained from the initial set of features. These are computed in such a manner that newly obtained features are highly significant and independent of each other.

For every eigen vector there is an eigen value. The dimensions in the data determine the number of eigen vectors that we need to calculate. Consider a 2-D data set, for which 2 eigen vectors (and their respective eigen values) are computed. The idea behind eigen vectors is to use the Covariance matrix to understand where in the data there is the most amount of variance. Since more variance in the data denotes more information about the data, eigen vectors are used to identify and compute Principal Components. Eigen Values simply denotes the scalars of the respective eigen vectors.

Step 4: Computing principal component: Once we have computed the Eigen vectors and eigen values, all we have to do is order them in the descending order, where the eigen vector with the highest eigen value is the most significant and thus forms the first principal component. The principal components of lesser significances can be removed to reduce the dimensions of the data. The final step in computing the Principal Components is to form a matrix known as the feature matrix that contains all the significant data variables that possess maximum information about the data.

Step 5: Replacing and Recreating Data Matrix: The last step is to re-arrange the original data with the final principal components which represent the maximum and the most significant information of the data set. In order to replace the original data axis with the newly formed Principal Components, we simply multiply the transpose of the original data set by the transpose of the obtained feature vector.

--

--