Bioinformatics Bioinformatics Scripting


Pinterest LinkedIn Tumblr

Biopython is a set of freely available tools for biological computation written in Python by an international team of developers.

It is a distributed collaborative effort to develop Python libraries and applications which address the needs of current and future work in bioinformatics. The source code is made available under the Biopython License, which is extremely liberal and compatible with almost every license in the world.

The Biopython project is an open-source collection of non-commercial Python tools for computational biology and bioinformatics, created by an international association of developers. It contains classes to represent biological sequences and sequence annotations, and it is able to read and write to a variety of file formats. It also allows for a programmatic means of accessing online databases of biological information, such as those at NCBI. Separate modules extend Biopython’s capabilities to sequence alignment, protein structure, population genetics, phylogenetics, sequence motifs, and machine learning. Biopython is one of a number of Bio* projects designed to reduce code duplication in computational biology.

The Biopython Project is an international association of developers of freely available Python tools for computational molecular biology. Python is an object oriented, interpreted, flexible language that is becoming increasingly popular for scientific computing. Python is easy to learn, has a very clear syntax and can easily be extended with modules written in C, C++ or FORTRAN.

The Biopython web site provides an online resource for modules, scripts, and web links for developers of Python-based software for bioinformatics use and research. Basically, the goal of Biopython is to make it as easy as possible to use Python for bioinformatics by creating high-quality, reusable modules and classes. Biopython features include parsers for various Bioinformatics file formats (BLAST, Clustalw, FASTA, Genbank,…), access to online services (NCBI, Expasy,…), interfaces to common and not-so-common programs (Clustalw, DSSP, MSMS…), a standard sequence class, various clustering modules, a KD tree data structure etc. and even documentation.

Basically, we just like to program in Python and want to make it as easy as possible to use Python for bioinformatics by creating high-quality, reusable modules and scripts.

The main Biopython releases have lots of functionality, including:

  • The ability to parse bioinformatics files into Python utilizable data structures, including support for the following formats:
    • Blast output – both from standalone and WWW Blast
    • Clustalw
    • FASTA
    • GenBank
    • PubMed and Medline
    • ExPASy files, like Enzyme and Prosite
    • SCOP, including ‘dom’ and ‘lin’ files
    • UniGene
    • SwissProt
  • Files in the supported formats can be iterated over record by record or indexed and accessed via a Dictionary interface.
  • Code to deal with popular on-line bioinformatics destinations such as:
    • NCBI – Blast, Entrez and PubMed services
    • ExPASy – Swiss-Prot and Prosite entries, as well as Prosite searches
  • Interfaces to common bioinformatics programs such as:
    • Standalone Blast from NCBI
    • Clustalw alignment program
    • EMBOSS command line tools
  • A standard sequence class that deals with sequences, ids on sequences, and sequence features.
  • Tools for performing common operations on sequences, such as translation, transcription and weight calculations.
  • Code to perform classification of data using k Nearest Neighbors, Naive Bayes or Support Vector Machines.
  • Code for dealing with alignments, including a standard way to create and deal with substitution matrices.
  • Code making it easy to split up parallelizable tasks into separate processes.
  • GUI-based programs to do basic sequence manipulations, translations, BLASTing, etc.
  • Extensive documentation and help with using the modules, including this file, on-line wiki documentation, the web site, and the mailing list.
  • Integration with BioSQL, a sequence database schema also supported by the BioPerl and BioJava projects.

Biopython is the largest and most popular bioinformatics package for Python. It contains a number of different sub-modules for common bioinformatics tasks. It is developed by Chapman and Chang, mainly written in Python. It also contains C code to optimize the complex computation part of the software. It runs on Windows, Linux, Mac OS X, etc.

Basically, Biopython is a collection of python modules that provide functions to deal with DNA, RNA & protein sequence operations such as reverse complementing of a DNA string, finding motifs in protein sequences, etc. It provides lot of parsers to read all major genetic databases like GenBank, SwissPort, FASTA, etc., as well as wrappers/interfaces to run other popular bioinformatics software/tools like NCBI BLASTN, Entrez, etc., inside the python environment. It has sibling projects like BioPerl, BioJava and BioRuby.


Biopython is portable, clear and has easy to learn syntax. Some of the salient features are listed below;

  • Interpreted, interactive and object oriented.
  • Supports FASTA, PDB, GenBank, Blast, SCOP, PubMed/Medline, ExPASy-related formats.
  • Option to deal with sequence formats.
  • Tools to manage protein structures.
  • BioSQL − Standard set of SQL tables for storing sequences plus features and annotations.
  • Access to online services and databases, including NCBI services (Blast, Entrez, PubMed) and ExPASY services (SwissProt, Prosite).
  • Access to local services, including Blast, Clustalw, EMBOSS.


The goal of Biopython is to provide simple, standard and extensive access to bioinformatics through python language. The specific goals of the Biopython are listed below;

  • Providing standardized access to bioinformatics resources.
  • High-quality, reusable modules and scripts.
  • Fast array manipulation that can be used in Cluster code, PDB, NaiveBayes and Markov Model.
  • Genomic data analysis.


Biopython requires very less code and comes up with the following advantages;

  • Provides microarray data type used in clustering.
  • Reads and writes Tree-View type files.
  • Supports structure data used for PDB parsing, representation and analysis.
  • Supports journal data used in Medline applications.
  • Supports BioSQL database, which is a widely used standard database amongst all bioinformatics projects.
  • Supports parser development by providing modules to parse a bioinformatics file into a format specific record object or a generic class of sequence plus features.
  • Clear documentation based on cookbook-style.

Modules for Biophyton

  • Bio.File module
  • Bio.SeqRecord module
  • Bio.bgzf module
  • Bio.kNN module
  • Bio.Index module
  • Bio.pairwise2 module
  • Bio.Logistic Regression module
  • Bio.Markov Model module
  • Bio.Max Entropy module
  • Bio.Naive Bayes module
  • Bio.Seq module
  • Bio.Seq Feature module
  • Bio.triefind module
  • Bio.Phylo module
  • Bio.PDB module
  • Bio.PopGen module

Write A Comment