NATIONAL CENTER FOR BIOTECHNOLOGY INFORMATION: AN OVERVIEW

(September 2004)

It is well acknowledged that scientific information is being generated at an exponentially increasing rate. One recent molecular biology endeavor is of particular public interest: The Human Genome Project (HGP) sequenced and mapped the complete human genome. Though the HGP was completed successfully, the work of the HGP is far from over. The structure, function, and molecular mechanisms of all the genetic elements comprising the human genome have yet to be discovered. Bioinformatics is one approach being used in this area. Bioinformatics can be defined as the application of computing tools to biological problems. The Internet provides an accessible and efficient platform capable of housing bioinformatics. Many scientists refer to the next wave in bioinformatics as systems biology, an approach to tackle new and complex biological questions. Systems biology involves the integration of genomics, proteomics, and bioinformatics to create a whole system view of a biological entity.

A plethora of bioinformatic tools exist on the Internet, but two particularly good sources of information, tools, and resources can be easily accessed at the National Center for Biotechnology Information (NCBI) website (http://www.ncbi.nlm.nih.gov/) and UBC’s Bioinformatics Center (UBiC) links directory (http://bioinformatics.ubc.ca/resources/links_directory/). The NCBI website is currently the paramount bioinformatics resource made available to researchers and the public. The NCBI offers many services of interest to scientists and students alike. However, even the NCBI’s resources are not exhaustive.

This article provides a brief overview of the NCBI and the various resources made available for scientific research and public education. The NCBI is a very general resource for bioinformatic tools and there are more powerful and specialized tools available elsewhere on the Internet. The importance of the NCBI is that it is an accessible and comprehensive source of molecular biology information.

History of the NCBI

The National Center for Biotechnology Information (NCBI) is a multi-disciplinary research group that serves as a resource for molecular biology information. It was formed in 1988 as a complement to the activities of the National Institutes of Health (NIH) and the National Library of Medicine (NLM). Its facilities are located in Bethesda, Maryland, USA. Initially, NCBI’s creation was intended to aid in understanding the molecular mechanisms that affect human health and disease with the following goals: to create and maintain public databases, develop software to analyze genomic data, and to conduct research in computational biology. In time, and through widespread use of the Internet, NCBI became increasingly aware of the role of pure biological research. Molecular biology became as prominent as biomedical research. This was evident as various specialized databases were being created by the NCBI, to compliment those that dealt directly with human health. NCBI began offering services as well:

-developing new methods to deal with the volume and complexity of data researching into methods that can analyze the structure and function of macromolecules.

-creating computerized systems for storing and analyzing data about molecular biology.

-providing access to analysis and computing tools (which facilitate the use of databases and software) to researchers and the public.

In the process of database development, NCBI formed database standards such as database nomenclature that are also used by other non-NCBI databases. One NCBI database is GenBank, the nucleic acid sequence database that contains sequence information from over 200 000 different organisms. GenBank is probably the most popular database in use, and actually predates NCBI. To many, its name is synonymous with the NCBI.

Genbank as the model database

One of NCBI’s roles is to maintain publicly available databases. But what exactly are databases, and why are they important for molecular biology? Basically, a database is a large and organized body of data. But one of the key criteria for a biological database is persistent data. In other words, the information encoded and represented by the data may change but the type of data is more resistant to change. This inflexibility of data is a reflection of what comprises macromolecules and how scientists have chosen to symbolize nature. For instance, the sequence of nucleic acids can be symbolized by letters representing nucleotides and a protein sequence can be represented by 20 letters symbolizing the amino acids. These strings of letter symbols constitute a staggering amount of information, but for computerized systems they can easily be organized and manipulated in an optimal way. A model sequence database is GenBank.
GenBank, a database containing all known nucleic acid sequences, is one of the members of the “Triple Entente” of sequence databases; the other two are the European Molecular Biology Laboratory (EMBL) and the DNA Database of Japan (DDBJ), all three of which are part of the International Nucleotide Sequence Database project. The two latter sequence databases include all of Genbank’s sequences. As of August 2003, Genbank contained 27.2 million different sequences. Over 185 complete microbial genomes are available (September 2004) as well as over fifty eukaryotic genomes (July 2003)(including the human genome). Approximately 26% of sequences in the database are of human origin (1)

Searching for a sequence in GenBank is referred to as “making a query”. The information that springs up is called the “record” (entry) for the query. The record for each sequence in GenBank contains a brief description of the sequence, the scientific name and taxonomy of the source organism from which the sequence was derived, bibliographic references, and a list of “features”. Features include the coding sequence regions of the nucleic acid and other sites of biological importance. In addition, the protein sequences of the translated nucleic acid coding regions are included. Each GenBank record is assigned an “accession number” which is a stable and unique identifier of the record that doesn’t change with time. In addition, a “GenInfo (gi) number” is assigned to each sequence as is the “version of the accession number”; these numbers do change. For example when the sequence is updated for CUT1-Receptor (Accession number: AB123456, Version: AB123456.1, gi number: 123456789), the version and gi numbers change. This facilitates archiving of data and prevents inconsistencies of sequence information in the literature.

Various methods are used generate the sequence information found in Genbank. Roughly 70% of all sequences in GenBank are ESTs (Expressed Sequence Tags), which are generated by reverse transcribing mRNAs into complementary cDNAs, and then performing single-pass sequencing on those cDNAs. ESTs thus represent segments of DNA that code for an mRNA, and are a fast, inexpensive way to determine which genes are being actively transcribed in a tissue at a given stage of development. Other common experimental methods for sequence generation include Sequence-Tagged Sites (STS) used to derive physical maps in genome construction, and Genome Survey Sequence (GSS), short random sequences used commonly to quickly sample the type of DNA sequences that could be found in a genome.
NCBI offers online software to help researchers submit sequence data into GenBank . Individual researchers may submit a single sequence. Larger submissions often come from sequencing centers, which may submit many sequences or entire genomes. The link between submitting sequence data to GenBank and publication is also a coordinated effort; journals that publish sequence data require GenBank submission as a condition for publication. The online submission tools include BankIt, Sequin, and tbl2asn. Bankit is the simplest of these tools, and requires the author to enter the sequence, and then add any biological annotations such as coding regions. Sequin allows for the submission of multiple or complex sequences and has a more organized method of sequence submission. Genome centers use programs such as tbl2asn, a more powerful command-line analog of Sequin.

Once a sequence has been added to the database, what preparations are necessary before analysis of the data can begin? The answer is found in database retrieval tools.

Retrieving Genbank data and data from other NCBI databases

The primary database retrieval system at NCBI is Entrez, which links together several databases including GenBank. The central database in Entrez is the nucleotide database Genbank, which links to the following databases: PubMed, Protein Sequence, Genomes, Taxonomy, Structure, Population, Online Mendelian Inheritance in Man (OMIM), Books, and 3D Domains. Connections between entries in a database are called neighbours, and connections between entries of different databases are called hardlinks. For example, a sequence retrieved from GenBank can hardlink to a literature citation in PubMed for the particular sequence. PubMed is the NCBI literature citation database which contains abstracts of over 12 million journal articles. Once a sequence is found in GenBank, or once any data is found in any of the various databases, a list of topic-related journal abstracts can be conjured up in PubMed using hardlinks. Unfortunately, full-text electronic-journals cannot be accessed through all of NCBI’s databases free of charge, only PubMedCentral provides free access to full-text articles. Fortunately, university libraries (such as the UBC library) do buy the service to provide it for free to their users.

Other database retrieval systems offered by NCBI include LocusLink, the Taxonomy Browser, and Gene. LocusLink offers descriptive information about genes and is based on curated data. The Taxonomy Browser offers information on lineage of organisms that have corresponding sequences in GenBank. Taxonomic and phylogenetic trees can also be viewed through the Taxonomy Browser. Gene is poised to become the successor of Locuslink, with greater scope, and integration into NCBI’s Entrez system.

NCBI offers a number of sequence data formats, including FASTA. FASTA is a simple text format that is commonly used by many pieces of bioinformatics software.

NCBI’s data-analytic software tools

The ultimate goal of bioinformatics is to enable researchers to make novel connections between data. Analytic software tools allow for the conducting of scientific experiments, the rejection of hypotheses, and the drawing of conclusions concerning molecular biology. Although not a substitute for the workbench, bioinformatics acts as a useful complement to laboratory-generated data. Many data-analytic tools exist at NCBI, UBiC and other places on the web. Due to the overwhelming number of techniques available for analyzing data, and to the relatively new analytic software, conditions for the use of any of these tools may be confusing. Mistakes due to unfamiliarity with the tools remains quite common. Other tools have gained widespread use simply by being easy to use. One such tool is the Basic Linear Alignment Search Tool (BLAST), which is most commonly used to analyze nucleic acid sequences from GenBank.

BLAST is a software tool that aligns two sequences in order to decide whether sequence similarity exists between the two sequences. The sequences can either be two nucleotide sequences or two protein sequences. From the sequence similarity, homology can be inferred, although there is a distinct difference between the two. Homology indicates that the sequences studied came from a common ancestral sequence. Homology between sequences is also indicative of (but not sufficient to prove) similar function at the molecular level. Misunderstanding about the meaning of the term can be illustrated by statements like, “these two sequences are 66% homologous” and “homology exists to this degree”. Homology is not based on percentage or degree; its existence is an extreme. Homology either exists between sequences or it doesn’t. So how does BLAST support inferences of homology? BLAST is based on the notion of percent-similarity between sequences, statistical models of the distribution of obtaining a given nucleotide sequence by chance. If two nucleotide sequences show a degree similarity they could, according to the statistical model, be used by a researcher to infer homologous sequences. Different statistical models exist for protein sequences. NCBI offers a variety of BLAST-based tools for analyzing different data types. Besides using BLAST to support an inference about homology between two sequences, it is possible to BLAST a query sequence against the human genome or the mouse genome to look for homologous sequences

Other NCBI data-analytic tools include Electronic-PCR, which locates Sequence-Tagged Sites, and BLAST-Link (Blink), which shows protein BLAST alignments for every protein sequence found in Entrez. Many more tools can be accessed through NCBI’s website. Some of these data-analytic tools are also databases. A non-exhaustive list of tools includes: OrfFinder (for open-reading frames), RefSeq, UniGene, SNP Database (for single-nucleotide polymorphisms), Human Genome Sequencing, Human MapViewer (to view the draft of the human genome project), Gene Expression Omnibus, Online Mendelian Inheritance in Man (OMIM) (catalogues human genetic diseases), the Molecular Modeling Database (MMDB) which is a 3D protein structure database, and the Conserved Domain Database (CDD).

Databases and public education

One Entrez database serves as a potential source for public education in molecular biology: it is the BOOKS Database. Not only do the web-based books supplement and clarify topics, they also serve as a highly credible resource for science reporters and journalists. The news is often the only mode of scientific information transfer between the researcher and the public. In addition university students may find some required course textbooks in the database. For instance, Lodish’s Molecular Cell Biology (UBC’s Biology 350), Albert’s Essential Cell Biology (UBC’s Biology 441), Gilbert’s Developmental Biology (UBC’s Biology 331), Modern Genetic Analysis (UBC’s Biology 334&335), and Janeway’s Immunobiology (UBC’s Microbiology 301) contents are fully available.

In addition, NCBI provides “Science Primers” on areas that form the theoretical foundations of NCBI itself, with tutorials on topics such as bioinformatics, ESTs, microarray technology, STSs, and molecular modeling. Lastly, NCBI offers tutorials on how to use its various databases and data-analytic software tools

Conclusions

With input in mapping the human genome, NCBI’s services are undeniably important. NCBI offers a comprehensive array of databases and software tools to analyze information. The advantage of having NCBI is that they offer a sizable quantity of accessible information to the public. NCBI continues the scientific tradition of making scientific knowledge free for all, which is an uncommon phenomenon in today’s world of biotech companies and their closely guarded patents. Bioinformatics, as a discipline, continues to grow at an exponential rate. The NCBI currently combats the problem of redundancy of information by establishing non-redundant databases to limit search-times and increase the ease of making a query. The NCBI website currently handles its services efficiently, despite the overwhelming amount of services present. To continue this efficiency, NCBI must be aware of and receptive to new ways of assimilating data into an organized form

Glossary

1. Curated data = the information supplied is based on the consensus and opinions of a number of researchers.

2. BLAST a query sequence = To input a sequence under study into the database and compare it to the entire collection of sequences in the GenBank database in order to search for homologous sequences.

References

1. Benson DA, Karsch-Mizrachi I, Lipman DJ, Ostell J, Wheeler DL. GenBank: Update. Nucleic Acids Research, 2004, vol 32, Database Issue: D23-D26.

Recommended Resources for Further Information

1.UBC’s Bioinformatics Center website http://bioinformatics.ubc.ca
From UBiC, “The mission of the UBC Bioinformatics Centre is to be a world-class centre of excellence for bioinformatics research, training and support. The centre will facilitate bioinformatics research and education across campus and will promote the development and integration of research from the diverse fields associated with bioinformatics.”

2. The NCBI Website http://www.ncbi.nlm.nih.gov/
There is a never-ending series of links. The most useful place to start is probably the SiteMap. The best place to visualize the databases and software tools is the website itself. Experimenting and playing with NCBI’s services is the best way to learn about how they work.

3. A printed resource is the book by Baxevanis and Ouelette entitled Bioinformatics: A Practical Guide to the Analysis of Genes and Proteins.
It contains colourplates of many different databases (some of which are NCBI databases).

4. Journals
A good journal for information on bioinformatics databases is Nucleic Acids Research.
This journal publishes an issue devoted entirely to databases at the beginning of each year