Proteins generally consist of one or more functional regions, commonly known as domains. The presence of different domains in varying combinations in different proteins gives rise to the diverse repertoire of proteins found in nature. Identifying the domains present in a protein can provide insights into the function of that protein.
The Pfam database is a large collection of protein domain families. Each family is represented by multiple sequence alignments and a hidden Markov model (HMMs).
Each Pfam family, often known as a Pfam-A entry, consists of a curated seed alignment containing a small set of representative members of the family, profile hidden Markov models (profile HMMs) built from the seed alignment, and an automatically generated full alignment, which contains all detectable protein sequences belonging to the family.
Pfam entries are classified in one of six ways:
Regions that are conserved:
Related Pfam entries are grouped together into clans, the relationship may be defined by similarity of sequence, structure or profile-HMM.
It provides a complete and accurate classification of protein families and domains. Originally, the rationale behind creating the database was to have a semi-automated method of curating information on known protein families to improve the efficiency of annotating genomes. The Pfam classification of protein families has been widely adopted by biologists because of its wide coverage of proteins and sensible naming conventions.
It is used by experimental biologists researching specific proteins, by structural biologists to identify new targets for structure determination, by computational biologists to organise sequences and by evolutionary biologists tracing the origins of proteins.
The Pfam website allows users to submit protein or DNA sequences to search for matches to families in the database. If DNA is submitted, a six-frame translation is performed, then each frame is searched. Rather than performing a typical BLAST search, Pfam uses profile hidden Markov models, which give greater weight to matches at conserved sites, allowing better remote homology detection, making them more suitable for annotating genomes of organisms with no well-annotated close relatives. The protein family databases Prints45 and Blocks46 are used on a set of short ungapped blocks of aligned residues to describe each family in Pfam.
Pfam has also been used in the creation of other resources such as iPfam, which catalogs domain-domain interactions within and between proteins, based on information in structure databases and mapping of Pfam domains onto these structures.
- View a description of the family
- View protein domain architectures
- Examine species distribution
- Follow links to other databases
- View known protein structures
- search protein or DNA sequence against our models
- browse our families and clans
- retrieve text annotation about any given family/entry
- view multiple sequence alignments of a family or clan
- view relationships between families in a clan
- see protein structure information in the context of a family
- view families according to their taxonomic spread
- search the database by keywords
Pfam data are available in a variety of formats, which include flat files and relational table dumps, both of which can be downloaded from the FTP site.