Information Retrieval Systems

Information retrieval (IR) is the science and practice of identification and efficient use of recorded media. Although medical informatics has traditionally concentrated on the retrieval of text from the biomedical literature, the domain over which IR can be applied effectively has broadened considerably with the arrival of multimedia publishing and vast storehouses of chemical structures, cartographic materials, gene and protein sequences, video clippings, and a wide range of other digital media of relevance to biomedical education, research, and patient care.

As mentioned above, a major goal in developing databases is to provide efficient and user friendly access to the data stored. There are a number of retrieval systems for biological data. To perform complex queries in a database sometimes requires the use of Boolean operators. This is to join a series of keywords using logical terms such as AND, OR, and NOT to indicate relationships between the keywords used in a search. AND means that the search result must contain both words; OR means to search for results containing either word or both; NOT excludes results containing either one of the words. In addition, one can use parentheses to define a concept if multiple words and relationships are involved, so that the computer knows which part of the search to execute first. Items contained within parentheses are executed first. Quotes can be used to specify a phrase. Most search engines of public biological databases use some form of this Boolean logic.

There are three main data retrieval systems of particular application to molecular biology. These are;

  • Sequence Retrieval System (SRS)
  • Entrez

These systems allow text searching of multiple molecular biology databases and give us links to relevant information for entries that match the search criteria.

The information retrieval system is made up of two components: the indexing system and the query system. The first of these is in charge of analyzing the documents downloaded from the Web and with the creation of indexes that then allow search queries to be made while the second is the search engine’s visible interface, that is, the part with which users interact.

A search engine does is to query its internal indexes.

  • Indexing system: indexing and searching methods and procedures (an indexing system can be human or automated)
  • Collection of documents: text, image or multimedia documents, or document surrogates (for example bibliographical records)
  • Defined set of queries: which are input into the system, with or without the involvement of a human 
  • Evaluation criteria: specified measures by which each system is evaluated, for example ‘precision’ and ‘recall’ as measures of relevance. Recall is the proportion of relevant documents in the collection retrieved in response to the query. Precision is the proportion of relevant documents amongst the set of documents retrieved in response to the query.

Search Engines

1. Entrez (NCBI)

Entrez is a retrieval system for searching several linked databases. It provides access to:

  • PubMed: Biomedical literature.
  • Genbank: Nucleotide sequence database Protein sequence database
  • Structure: three-dimensional macromolecular structures
  • Genome: complete genome assemblies
  • OMIM: Online Mendelian Inheritance in Man
  • Taxonomy: Organisms in GenBank

2. SRS (EBI and DDBJ)

SRS is a data retrieval system that integrates heterogeneous databanks in molecular biology and genome analysis. There are currently many dozen servers worldwide that provide access to over 300 different databanks via the World Wide Web. 

  • Sequence Retrieval tools: ENTREZ from NCBI and SRS (Sequence Retrieval System) from EBI
  • Sequence Submission tools: Sequin and BankIt from NCBI and WebIn from EBI

