Gene set enrichment analysis, also known as functional enrichment analysis, is a method to identify classes of genes or proteins that are over-represented in a large set of genes or proteins, and may have an association with disease phenotypes. The method uses statistical approaches to identify significantly enriched or depleted groups of genes. Transcriptomics technologies and proteomics results often identify thousands of genes which are used for the analysis.
Researchers performing high-throughput experiments that produce sets of genes usually want to retrieve a functional profile of that gene set, in order to better understand the underlying biological processes. This can be done by comparing the input gene set to each of the bins (terms) in the gene ontology, a statistical test can be performed for each bin to see if it is enriched for the input genes.
Gene set enrichment analysis uses a priori gene sets that have been grouped together by their involvement in the same biological pathway, or by proximal location on a chromosome. A database of these predefined sets can be present at the Molecular signatures database (MSigDB). In GSEA, DNA microarrays, or now RNA-Seq, are still performed and compared between two cell categories, but instead of focusing on individual genes in a long list, the focus is put on a gene set.
In the method that is typically known as standard GSEA, there are three steps involved in the analytical process.
1. Calculate the enrichment score (ES) that represents the amount to which the genes in the set are over-represented at either the top or bottom of the list. This score is a Kolmogorov–Smirnov-like statistic.
2. Estimate the statistical significance of the ES. This calculation is done by a phenotypic-based permutation test in order to produce a null distribution for the ES. The P value is determined by comparison to the null distribution.
o Calculating significance this way tests for the dependence of the gene set on the diagnostic/phenotypic labels
3. Adjust for multiple hypotheses testing for when a large number of gene sets are being analyzed at one time. The enrichment scores for each set are normalized and a false discovery rate is calculated.
GSEA is an algorithm that performs differential expression analysis at the level of gene sets. The input to GSEA consists of a collection of gene sets and microarray expression data with replicates for two conditions to be compared. GSEA uses a permutation-based test which uses Kolmogorov–Smirnov running sum statistics to determine which of the gene sets from the collection are differentially expressed between the two conditions. GSEA differs from differential gene expression analysis in the sense that it might identify genes which are part of a differentially expressed set but which might not be identified as significantly differentially expressed alone.