Basic Local Alignment Search Tool (BLAST)
The Basic Local Alignment Search Tool (BLAST) finds regions of similarity between sequences. The program compares nucleotide or protein sequences and calculates the statistical significance of matches. BLAST can be used to infer functional and evolutionary relationships between sequences as well as help identify members of gene families.
There are several types of BLAST searches. NCBI’s WebBLAST offers four main search types:
- BLASTn (Nucleotide BLAST): compares one or more nucleotide query sequences to a subject nucleotide sequence or a database of nucleotide sequences. This is useful when trying to determine the evolutionary relationships among different organisms (see Comparing two or more sequences below).
- BLASTx (translated nucleotide sequence searched against protein sequences): compares a nucleotide query sequence that is translated in six reading frames (resulting in six protein sequences) against a database of protein sequences. Because blastx translates the query sequence in all six reading frames and provides combined significance statistics for hits to different frames, it is particularly useful when the reading frame of the query sequence is unknown or it contains errors that may lead to frame shifts or other coding errors. Thus blastx is often the first analysis performed with a newly determined nucleotide sequence.
- tBLASTn (protein sequence searched against translated nucleotide sequences): compares a protein query sequence against the six-frame translations of a database of nucleotide sequences. Tblastn is useful for finding homologous protein coding regions in unannotated nucleotide sequences such as expressed sequence tags (ESTs) and draft genome records (HTG), located in the BLAST databases est and htgs, respectively. ESTs are short, single-read cDNA sequences. They comprise the largest pool of sequence data for many organisms and contain portions of transcripts from many uncharacterized genes. Since ESTs have no annotated coding sequences, there are no corresponding protein translations in the BLAST protein databases. Hence a tblastn search is the only way to search for these potential coding regions at the protein level. The HTG sequences, draft sequences from various genome projects or large genomic clones, are another large source of unannotated coding regions.
- BLASTp (Protein BLAST): compares one or more protein query sequences to a subject protein sequence or a database of protein sequences. This is useful when trying to identify a protein (see From sequence to protein and gene below).
There are also standalone and API BLAST options as well as pre-populated specialized searches available on the BLAST homepage linked above.
From sequence to protein and gene
Object: Starting with a sequence, identify the protein or gene and the source.
Example: From the following sequence (available at http://tinyurl.com/blastp-sequence, or copy the sequence below), identify the most probable protein and organism:
MSKRKAPQET LNGGITDMLT ELANFEKNVS QAIHKYNAYR KAASVIAKYP HKIKSGAEAK
KLPGVGTKIA EKIDEFLATG KLRKLEKIRQ DDTSSSINFL TRVSGIGPSA ARKFVDEGIK
TLEDLRKNED KLNHHQRIGL KYFGDFEKRI PREEMLQMQD IVLNEVKKVD SEYIATVCGS
FRRGAESSGD MDVLLTHPSF TSESTKQPKL LHQVVEQLQK VHFITDTLSK GETKFMGVCQ
LPSKNDEKEY PHRRIDIRLI PKDQYYCGVL YFTGSDIFNK NMRAHALEKG FTINEYTIRP
LGVTGVAGEP LPVDSEKDIF DYIQWKYREP KDRSE
Querying a sequence
Protein and gene sequence comparisons are done with BLAST (Basic Local Alignment Search Tool).
To access BLAST, go to Resources > Sequence Analysis > BLAST:
This is an unknown protein sequence that we are seeking to identify by comparing it to known protein sequences, and so Protein BLAST should be selected from the BLAST menu:
Enter the query sequence in the search box, provide a job title, choose a database to query, and click BLAST:
Viewing your results
Under the Alignments tab next to Alignment view select Pairwise with dots for identities.
View the Descriptions tab to see a list of significant alignments. Note that the first match is a synthetic construct (that is, the sequence was computationally derived and is not associated with any organism):
Key for default display:
- Max[imum] Score: the highest alignment score calculated from the sum of the rewards for matched nucleotides and penalities for mismatches and gaps.
- Total Score: the sum of alignment scores of all segments from the same subject sequence.
- Query Cover[age]: the percent of the query length that is included in the aligned segments.
- E[xpect] Value: the number of alignments expected by chance with the calculated score or better. The expect value is the default sorting metric; for significant alignments the E value should be very close to zero.
- Ident[ity]: the highest percent identity for a set of aligned segments to the same subject sequence.
- Acc[ession] Len[gth]: the number of nucleotides or amino acids in the result sequence identified by the accession number
- Accession [number]: a unique identifier assigned to records in the NCBI databases
Clicking on a protein name displays the pairwise sequence alignment and links to additional information about the protein and its associated gene (if available).
For the pairwise with dots for identities display, any differing amino acid in the subject sequence will be displayed in red:
Saving your results
To save your search queries and settings, click on the Save Search link, then log in to My NCBI using the Sign in or Register link at the upper right. Once you do this, your search strategies should appear in the Saved Search Strategies tab.
Comparing two or more sequences
Object: Starting with two or more sequences, compare them and find the differences.
Example: In the NCBI database Nucleotide, enter the following search:
human[organism] AND mitochondrion[title]
This will search for nucleic acid sequences from humans with the word “mitochondrion” in the title. Mitochondrial DNA is often used in evolutionary comparisons because it is inherited only through the maternal lineage and changes very slowly.
Limit the results to NCBI Reference Sequences by selecting the RefSeq limit under Source databases in the left-hand Filter menu. These are high-quality sequences that have been curated and annotated by NCBI staff.
There are three Reference Sequences for the mitochondrial genome in humans: one for modern humans (Homo sapiens), one for Neanderthals (Homo sapiens neanderthalensis), and one for Denisovans (Homo sp. Altai).
In the right-hand discovery menu under Analyze these sequences click Run BLAST.
This will open BLASTn, Nucleotide BLAST, and automatically add the accession numbers of these Reference Sequences into the Query Sequence box.
To compare sequences, check the box next to Align two or more sequences under the Query Sequence box. To BLAST the modern human mitochondrial genome sequence (NC_012920.1) against the subject sequences of Neanderthal (NC_011137.1) and Denisovan (NC_013993.1), move the latter two accession numbers from the Query Sequence box into the Subject Sequence box using copy and paste.
Enter a job title and click BLAST, leaving the other settings at their default options.
You should see two results, in which the query sequence (modern human) is compared to one of the subject sequences, Neanderthal or Denisovan. Note that the query sequence is 99% similar to the Neanderthal sequence, and 98% similar to the Denisovan sequence.
To see how the sequences differ and what the biological significance might be:
- Go to the Alignments tab and in the Alignment view drop-down menu select Pairwise with dots for identities.
- Click the checkbox next to CDS feature.
Click on the name of the first result (Homo sapiens neanderthalis). You should see a base-by-base comparison of the two sequences in two lines. The top line is the query sequence (modern human). In the second line, representing the subject sequence (ancient human), bases where the subject sequence is identical to the query sequence are replaced by dots, and bases where the subject sequence differs from the query sequence appear in red.
Scroll down to the first coding sequence (CDS). The CDS regions are displayed in four lines: the first line shows the amino acid translation for the query sequence (modern human) on the second line. The third line is the subject sequence (ancient human), and the one below shows the amino acid translation for the subject sequence.
Note that there are two additional amino acids, M (methionine) and P (proline), at the beginning of the protein sequence in modern humans compared to Neanderthal. This is due to the substitution of T (thymine) at position 3308 in the modern human sequence for C (cytosine) in the analogous position in the Neanderthal sequence.
Note as well that the substitution of A (adenine) at position 3334 in the modern human sequence for G (guanine) in the Neanderthal sequence results in an amino acid difference in the protein sequences. In the modern human protein sequence an I (isoleucine) replaces a V (valine) present in the Neanderthal protein sequence.
To investigate the biological significance of this change, go to the Amino Acid Explorer. In the left-hand menu, use the Compare tool to see what effects a change from V to I might have. Look at both the text and graphics comparisons. Does this seem to be a conservative mutation (that is, one that results in little or no change in protein structure or function) or a non-conservative mutation (that is, one that results in a significant change in protein structure or function)?
Now scroll down to the Denisovan result and look at positions 3308 and 3334 in the query sequence. Are there any differences in the Denisovan sequence at these positions?
To see how the species are related in evolutionary terms:
- Go to the Description tab and click on the Distance tree of results link.
- When the rectangle cladogram displays, go to the menu Tools > Layout and select Slanted Cladogram.
To which species, Denisovans or Neanderthals, are modern humans more closely related?