The problem of identifying genes that are differentially expressed in two conditions has received much attention from the statistical community and data analysts in general. Most of the work has focused on designing appropriate test statistics and developing procedures to account for multiple comparisons.
A common approach to interpreting gene expression data is gene set enrichment analysis based on the functional annotation of the differentially expressed genes. This is useful for finding out if the differentially expressed genes are associated with a certain biological process or molecular function.
There are three key elements of the GSEA method:
- Calculation of an Enrichment Score. We calculate an enrichment score (ES) that reflects the degree to which a set S is overrepresented at the extremes (top or bottom) of the entire ranked list L. The score is calculated by walking down the list L, increasing a running-sum statistic when we encounter a gene in S and decreasing it when we encounter genes not in S. The magnitude of the increment depends on the correlation of the gene with the phenotype. The enrichment score is the maximum deviation from zero encountered in the random walk; it corresponds to a weighted Kolmogorov–Smirnov-like statistic.
- Estimation of Significance Level of ES. We estimate the statistical significance (nominal P value) of the ES by using an empirical phenotype-based permutation test procedure that preserves the complex correlation structure of the gene expression data. Specifically, we permute the phenotype labels and recompute the ES of the gene set for the permuted data, which generates a null distribution for the ES.
- Adjustment for Multiple Hypothesis Testing. When an entire database of gene sets is evaluated, we adjust the estimated significance level to account for multiple hypothesis testing. We first normalize the ES for each gene set to account for the size of the set, producing a normalized enrichment score (NES). We then control the proportion of false positives by calculating the false discovery rate (FDR) corresponding to each NES. The FDR is the estimated probability that a set with a given NES represents a false positive finding.
Using the GO enrichment analysis tools
1. Paste or type the names of the genes to be analyzed, one per row or separated by a comma. The tool can handle both MOD specific gene names and UniProt IDs (e.g. Rad54 or P38086).
2. Select the GO aspect for our analysis (biological process is default).
3. Select the species our genes come from (Homo sapiens is default).
4. Press the submit button. Note that we will be able to upload a REFERENCE (aka “background”) LIST at a later step.
5. We will be redirected to the results on the PANTHER website. These results are based on enrichment relative to the set of all protein-coding genes in the genome we will select in step 3.
6. Add a custom REFERENCE LIST and re-run the analysis. Press the “change” button on the “Reference list” line of the PANTHER analysis summary at the top of the results page, upload the reference list file, and press the “Launch analysis” button to re-run the analysis.
The results page displays a table that lists significant shared GO terms (or parents of GO terms) used to describe the set of genes that users entered on the previous page, the background frequency, the sample frequency, expected p-value, an indication of over/underrepresentation for each term, and p-value. In addition, the results page displays all the criteria used in the analysis. Any unresolved gene names will be listed on top of the table.
Background frequency is the number of genes annotated to a GO term in the entire background set, while sample frequency is the number of genes annotated to that GO term in the input list. The symbols + and – indicate over or underrepresentation of a term.
P-value is the probability or chance of seeing at least x number of genes out of the total n genes in the list annotated to a particular GO term, given the proportion of genes in the whole genome that are annotated to that GO Term.
Our company, BioinfoLytics, is affiliated with BioCode and is a project, where we are providing many topics on Genomics, Proteomics, their analysis using many tools in a cool way, Sequence Alignment & Analysis, Bioinformatics Scripting & Software Development, Phylogenetic and Phylogenomic Analysis, Functional Analysis, Biological Data Analysis & Visualization, Custom Analysis, Biological Database Analysis, Molecular Docking, Protein Structure Prediction and Molecular Dynamics etc. for the seekers of Biocode to further develop their interest to take part in these services to fulfill their requirements and obtain their desired results. We are providing such a platform where one can find opportunities to learn, research projects analysis and get help and huge knowledge based on molecular, computational and analytical biology.