Principal Component Analysis (PCA), is a dimensionality-reduction method that is often used to reduce the dimensionality of large data sets, by transforming a large set of variables into a smaller one that still contains most of the information in the large set.
PCA simplifies the complexity in high-dimensional data while retaining trends and patterns. It does this by transforming the data into fewer dimensions, which act as summaries of features. High-dimensional data are very common in biology and arise when multiple features, such as expression of many genes, are measured for each sample. This type of data presents many challenges that PCA mitigates, computational expense and an increased error rate due to multiple test correction when testing each feature for association with an outcome. PCA is an unsupervised learning method and is similar to clustering, it finds patterns without reference to prior knowledge about whether the samples come from different treatment groups or have phenotypic differences.
PCA reduces data by geometrically projecting them onto lower dimensions called principal components (PCs), with the goal of finding the best summary of the data using a limited number of PCs. The first PC is chosen to minimize the total distance between the data and their projection onto the PC. By minimizing this distance, we also maximize the variance of the projected points. The second (and subsequent) PCs are selected similarly, with the additional requirement that they be uncorrelated with all previous PCs. It has been used in many other fields, including the biological, physical, and engineering sciences.
The starting point of PCA is the matrix of correlation coefficients derived from the original data set. Strictly speaking, the rationale behind the method requires that the correlations be obtained from variables measured on some continuous scale.
It is a geometrical projection analogy, used to introduce derivation of bilinear data models, focusing on scores, loadings, residuals, and data rank reduction.
Given a collection of points in two, three, or higher dimensional space, a “best fitting” line can be defined as one that minimizes the average squared distance from a point to the line. The next best-fitting line can be similarly chosen from directions perpendicular to the first. Repeating this process yields an orthogonal basis in which different individual dimensions of the data are uncorrelated. These basis vectors are called principal components, and several related procedures are principal component analysis (PCA).
PCA is either done by singular value decomposition of a design matrix or by doing the following 2 steps:
- calculating the data covariance (or correlation) matrix of the original data
- performing eigenvalue decomposition on the covariance matrix
Usually the original data is normalized before performing the PCA. The normalization of each attribute consists of mean centering – subtracting its variables measured mean from each data value so that its empirical means (average) is zero. Some fields, in addition to normalizing the mean, do so for each variable’s variance (to make it equal to 1. The results of a PCA are usually discussed in terms of component scores, sometimes called factor scores (the transformed variable values corresponding to a particular data point), and loadings (the weight by which each standardized original variable should be multiplied to get the component score).