A pan-genome is defined as the set of all unique gene families found in one or more strains of a prokaryotic species. Studies of pan-genomes have become popular due to the easy access to whole-genome sequence data for prokaryotes.
In the fields of molecular biology and genetics, a pan-genome (or supragenome) is the entire set of genes for all strains within a clade. The pan-genome includes: the core genome containing genes present in all strains within the clade, the accessory genome containing ‘dispensable’ genes present in a subset of the strains, and strain-specific genes. The study of the pan-genome is called pangenomics.
Some species have open (or extensive) pan-genomes, while others have closed pan-genomes. For species with a closed pan-genome, very few genes are added per sequenced genome (after sequencing many strains), and the size of the full pan-genome can be theoretically predicted. Species with an open pan-genome have enough genes added per additional sequenced genome that predicting the size of the full pan-genome is impossible. Population size and niche versatility have been suggested as the most influential factors in determining pan-genome size. The pan-genome can be broken down into a “core pan-genome” that contains genes present in all individuals, a “shell pan-genome” that contains genes present in two or more strains, and a “cloud pan-genome” that contains genes only found in a single strain.
Pan-genomes were originally constructed for species of bacteria and archaea, but more recently eukaryotic pan-genomes have been developed, particularly for plant species. Plant studies have shown that pan-genome dynamics are linked to transposable elements. The significance of the pan-genome arises in an evolutionary context, especially with relevance to metagenomics, but is also used in a broader genomics context.
As sequencing costs have dropped, thousands of new genomes have been sequenced, and scientists have come to realize that a single reference genome is inadequate for many purposes. By sampling a diverse set of individuals, one can begin to assemble a pan-genome: a collection of all the DNA sequences that occur in a species. As scientists continue to survey and catalogue the genomic variation across human populations and begin to assemble a human pan-genome, these efforts will increase our power to connect variation to human diversity, disease and beyond.
- characterizing strains by their individual gene set (e.g., detecting virulence factors only present in one particular strain of a species)
- develop vaccines against pathogenic strains
- detection, identification and tracking of new strains in metagenomics samples based on their individual gene subset of the species pangenome
- study the evolutionary impact of horizontal gene transfer
- Exploring strain diversity in environmental population genomics studies
- Roary: Fast tool for extracting complete pangenomes, core gene sets, or differences between reference genomes
- panX: pangenome analysis and web-based visualization
- PanOCT: considers both gene homology and conserved gene neighborhoods
- OrthoMCL: extracting the core genomes, etc..
- LS-BSR: rapid comparison of the genetic content of large numbers of genomes
- PanPhlAn: pangenome based detection of gene compositions of strains in environmental WGS samples