Genomics
Practical 2010

EST
Sequencing: Analysis part 2
In today's
session you will take your Expressed Sequence Tag (EST) sequences, and identify what proteins (if any) they encode and investigate what support they give to the debate on the relationship of onychophorans to Annelids (ARTICULATA hypothesis) and Nematodes (ECDYSOZOA hypothesis).
READ THROUGH
THIS DOCUMENT BEFORE STARTING THE DAY'S WORK...
Your results are available as
hyperlinks from here: GGIII2010 Results
1 
Log in, launch
a web browser, and direct it to
http://www.nematodes.org/teaching/genomics/Practical_2010_Euperipatoides_kanagrensis/practical_results2.html
(this
page)
You should
also launch MS Word or Notepad and open a file to act as your
notebook for the practical.
This practical
will be assessed, and thus you
should check the requirements for the writeup
(see
here)
before you continue through the sequence analysis work below.
2
Finding the open reading frame
OPEN READING FRAMES
Do your ESTs have open reading frames?
Of course they do, even if they are only very short... How do you identify the longest (and thus most likely) one?
There are several ways to find out:
(1) by eye (painful unless you happen to have memorised the 64-codon table of the genetic code, which of course you should have ;-)
(2) Using online translation programmes.
For these you just paste your sequence in a window, choose the genetic code, and press go. The DNA is translated into protein and you can then choose which you believe might be the correct open reading frame.
Try: EMBOSS Transeq at http://www.ebi.ac.uk/emboss/transeq/
or the EXPASY Translate Tool at http://expasy.org/tools/dna.html
3 
Annotating your genes with putative functions 1
Now you need to find out what the gene you have cloned might do for Euperipatoides kanagrensis . These data will be useful in describing the biology of the gene in your writeup, as part of the "molecular CV" for each gene.
To annotate your genes you need to use a range of programmes.
For each of your ESTs perform all of the following searches and RECORD what the results are.
There are three parts to this section:
(1) Annotation by BLAST: what proteins are your sequences similar to?
Performing a BLAST search of your EST DNA sequence, using BLASTX against a universal protein database such as the NCBI 'nr' (nonredundant protein database) will identify previously sequenced genes and/or proteins that may have functional annotation. As your sequence is similar to these proteins you can infer that your proteins may have similar functions (this is called 'Inference from Electronic Annotation' or IEA).
BLASTX performs a search of a PROTEIN database with your NUCLEOTIDE sequence translated in all SIX FRAMES (both forward and reverse).
Using BLASTX you can also identify frameshifts and other errors in your sequence.
http://www.ncbi.nlm.nih.gov/blast/
The 'standard' options under "Seqarch Set" in NCBI BLASTX allow you to choose which database to compare your sequences to. You should use "Non-redundant protein sequences (nr)". This is the best to use as it contains ALL the sequences in GenBank's protein database, but not the repeat entries that are present for many genes.
(a) open http://www.ncbi.nih.nlm.gov

(b) select the BLAST from the "Popular Resources" window

(c) choose 'blastx' from the list under Basic BLAST

(d) Paste your sequence in the text box under "Enter accession number, gi, or FASTA sequence"

(e) press the blue BLAST button, and wait for results

(f) review your results: remembering
(1) that a low E-value means a less likely match; in this case less likely means that it is less likely that the match happened by random chance, and thus the match is more significant biologically. E-values of greater than 1e-06 are not particularly good.
(2) The graphical view of the matches shows the quality of the match using the raw 'bit scores' of the match. In the case of bit scores, bigger is better. Scores of >40 are interesting, and scores of greater then 100 very good.
(3) that matches that extend over more of your sequence are likely to be more informative than short matches.

If you hover your mouse over the coloured bars in the image, the text box above will show you the 'definition line' of the match found.
Clicking on the bar will take you to the details of the alignment between your sequence and the database sequence (lower down the report).
(g) Just below the image is a list of the top matches to your sequence.
The link to the left (beginning 'gi' is to the sequence in GenBank.
The link in the 'Score (Bits)' column takes you to a detailed description of the match

(h) The detailed view of the match gives the full description line, some statistics and an alignment of the two sequences (your Query, and the Subject sequence matched).

(i) Check the species-of-origin of the matched sequences. Is this a surprise? How closely related to Euperipatoides kanagrensis is it?
(j) What is the biological function of the sequence matched? If the top sequence does not have a function ascribed to it, do any others in the list of SIGNIFICANT matches?
(2) Annotation by BLAST: using BLASTP
If you have identified a putative translation or open reading frame in your sequence, you can use BLASTP to compare theis to the protein database (similar to above).
(a) Check the species-of-origin of the matched sequences. Is this a surprise? How closely related to Euperipatoides kanagrensis is it?
(b) What is the biological function of the sequence matched? If the top sequence does not have a function ascribed to it, do any others in the list of SIGNIFICANT matches?
(3) Annotation by BLAST: what nucleotide sequences are your sequences similar to?
Perform a BLAST search of your EST DNA sequence against nucleotide databases.
This will identify
(a) whether your gene has been sequenced previously (although very few E. kanagrensis genes have been sequenced previously there are a few - in which case there will be a perfect or near-perfect match),
(b) whether a very similar sequence has been identified in a closely related onychophoran species and
(c) whether your sequence derives from an RNA gene (such as ribosomal RNA gene).
We suggest you perform three comparisons of your sequence to two databases:
(I) BLASTN against Nucleotide collection (nr/nt) - this is all the nucleotide sequences in the 'normal' division of GenBank
(II) BLASTN against Non-human, non-mouse ESTs (est_others) - this includes the expressed sequence tags from other onychophora.
(III) TBLASTX against Non-human, non-mouse ESTs (est_others) - this includes the expressed sequence tags from other onychophora.
http://www.ncbi.nlm.nih.gov/blast/
4 
Annotation with Putative Function 2
 |
BLAST is good at finding significantly similar matches, but it is not always easy to work out what it is the matched sequences do, in terms of their biology. As proteins are made up of functional domains, we can compare our protein sequence to a database of domains, and use any matches to these as indicators of function.
There are many domain databases, but probably the best one to search is InterPro, a system that aggregates a wide range of different domain databases under one unified search engine.
http://www.ebi.ac.uk/InterProScan/
Go to interproscan and follow the instructions there... |
(a) at the interpro site, go to the InterProScan page and enter your protein sequence

(b) wait
(c) when the results appear, you can review them in the grahic display shown. For each domain mtched there is a significance score. You can go to a detailed description of the biology of the domain by following the links given. Note that as InterPro aggregates many domain databases, one region of your protein can have matches to entries in more than one database.
5 
Comparison with Annelid and Nematode Genes
In order to assess whether Onychophora (here represented by Euperipatoides kanagrensis) are more closely related to Annelids (ragworms, earthworms, leeches and relatives) or Nematoda (roundworms and relatives), you need to compare your sequences to previously sequenced genes from these two groups.
Basically, if your sequence is more closely related to an Annelida sequence then it supports the ARTICULATA hypothesis, whereas if it is more similar to a Nematode sequence it supports the ECDYSOZOA hypothesis.
see the Onychophora lecture for more information

You can do this using the NCBI BLAST service and FILTERING the results by taxonomic group.
(a) Open a browser window at the NCBI ENTREZ Taxonomy homepage http://www.ncbi.nlm.nih.gov/Taxonomy/taxonomyhome.html/
(b) On the Taxonomy homepage, type in the taxon you would like to see records for ("Annelida" or "Nematoda"), and press GO.

(c) On the result page that is returned, select the taxon name ("Annelida" or "Nematoda").
You
(d) You should see a page similar to this:

This view shows a summary of the sequence and other data available for the taxon "Annelida". There are 30,398 nucleotide and 3,951 protein sequences. Why so few protein sequences compared to nucleotide?
(e) Select the highlighted number of nucleotide sequences. This will take you to a standard NCBI ENTREZ window showing the first 20 of the sequences. Note the tabs on the page that indicate different kinds of sequence:

What are these different kinds of sequence? What kinds are most useful to you? What do you get when you click on the "Protein" or "Genome" links on the Taxonomy page?
(f) In the query box at the top of the ENTREZ page, you will see that what you have displayed are all nucleotide sequences for the ENTREZ query "txid6340[Organism:exp]" . This taxid is a unique numerical identifier in NCBI ENTREZ for "Annelida" and so this search has found ALL annelid sequences; obviously the number of the taxon is different for Nematoda. This query can be used to limit the matches shown in a BLAST search at NCBI. Copy the query text.
(g) Open a web browser window at the NCBI BLAST start page http://www.ncbi.nlm.nih.gov/blast/
What sort of search do you want to do? Remember that nucleotide-nucleotide searches are optimised for finding nearly exact matches, and that you are looking for similarities between organisms that last shared a common ancestor over 600 million years ago.
If you are searching with a protein translation, choose BLASTP.
If you are searching with the nucleotide sequence of your EST or of its cluster consensus, use BLASTX.
Paste your sequence into the search box, choose the nr database, and in the "Choose Search Set " section below paste in your ENTREZ query text into the "ENTREZ query" box. Submit your BLAST search.
(h) The results will show ONLY those matches to your chosen 'limited' database.
Record the best hit and the score and E-value, and also the length of the match and the percentage identity and similarity.
Repeat for both an "Annelida"- and a "Nematoda"-limited search.
Your results are available as
hyperlinks from here: GGIII2010_Results
| Writeup
instructions
Back to the Top