|
Genomics Practical 2011

EST Sequencing: Analysis part 2
In today's
session you will take your Expressed Sequence Tag (EST) sequences, identify what proteins (if any) they encode and investigate what their functions are likely to be.
READ THROUGH
THIS DOCUMENT BEFORE STARTING THE DAY'S WORK...
Your results are available as
hyperlinks from here: GGIII2010 Results the extra sequences are here
1
Log in, launch
a web browser, and direct it to
http://www.nematodes.org/teaching/genomics/Practical_2011/computer_practical_day_2.shtml
(this
page)
You should
also launch MS Word or Notepad and open a file to act as your
notebook for the practical.
This practical
will be assessed, and thus you
should check the requirements for the writeup
(see
here)[Microsoft Word docx document)
before you continue through the sequence analysis work below.
2
Finding the open reading frame
OPEN READING FRAMES
Do your ESTs have open reading frames?
Of course they do, even if they are only very short... How do you identify the longest (and thus most likely) one?
There are several ways to find out:
(1) by eye (painful unless you happen to have memorised the 64-codon table of the genetic code, which of course you should have ;-)
(2) Using online translation programmes.
For these you just paste your sequence in a window, choose the genetic code, and press go. The DNA is translated into protein and you can then choose which you believe might be the correct open reading frame.
Try: EMBOSS Transeq at http://www.ebi.ac.uk/emboss/transeq/
or the EXPASY Translate Tool at http://expasy.org/tools/dna.html
3
Annotating your genes with putative functions 1
Now you need to find out what the gene you have cloned might do for Euperipatoides kanagrensis . These data will be useful in describing the biology of the gene in your writeup, as part of the "molecular CV" for each gene.
To annotate your genes you need to use a range of programmes.
For each of your ESTs perform all of the following searches and RECORD what the results are.
There are three parts to this section:
(1) Annotation by BLAST: what proteins are your sequences similar to?
Performing a BLAST search of your EST DNA sequence, using BLASTX against a universal protein database such as the NCBI 'nr' (nonredundant protein database) will identify previously sequenced genes and/or proteins that may have functional annotation. As your sequence is similar to these proteins you can infer that your proteins may have similar functions (this is called 'Inference from Electronic Annotation' or IEA).
BLASTX performs a search of a PROTEIN database with your NUCLEOTIDE sequence translated in all SIX FRAMES (both forward and reverse).
Using BLASTX you can also identify frameshifts and other errors in your sequence.
http://www.ncbi.nlm.nih.gov/blast/
The 'standard' options under "Seqarch Set" in NCBI BLASTX allow you to choose which database to compare your sequences to. You should use "Non-redundant protein sequences (nr)". This is the best to use as it contains ALL the sequences in GenBank's protein database, but not the repeat entries that are present for many genes.
(a) open http://www.ncbi.nih.nlm.gov

(b) select the BLAST from the "Popular Resources" window

(c) choose 'blastx' from the list under Basic BLAST

(d) Paste your sequence in the text box under "Enter accession number, gi, or FASTA sequence"

(e) press the blue BLAST button, and wait for results

(f) review your results: remembering
(1) that a low E-value means a less likely match; in this case less likely means that it is less likely that the match happened by random chance, and thus the match is more significant biologically. E-values of greater than 1e-06 are not particularly good.
(2) The graphical view of the matches shows the quality of the match using the raw 'bit scores' of the match. In the case of bit scores, bigger is better. Scores of >40 are interesting, and scores of greater then 100 very good.
(3) that matches that extend over more of your sequence are likely to be more informative than short matches.

If you hover your mouse over the coloured bars in the image, the text box above will show you the 'definition line' of the match found.
Clicking on the bar will take you to the details of the alignment between your sequence and the database sequence (lower down the report).
(g) Just below the image is a list of the top matches to your sequence.
The link to the left (beginning 'gi' is to the sequence in GenBank.
The link in the 'Score (Bits)' column takes you to a detailed description of the match

(h) The detailed view of the match gives the full description line, some statistics and an alignment of the two sequences (your Query, and the Subject sequence matched).

(i) Check the species-of-origin of the matched sequences. Is this a surprise? How closely related to Euperipatoides kanagrensis is it?
(j) What is the biological function of the sequence matched? If the top sequence does not have a function ascribed to it, do any others in the list of SIGNIFICANT matches?
(2) Annotation by BLAST: using BLASTP
If you have identified a putative translation or open reading frame in your sequence, you can use BLASTP to compare this to the protein database (similar to above).
(a) Check the species-of-origin of the matched sequences. Is this a surprise? How closely related to Sphaerularia bombi is it?
(b) What is the biological function of the sequence matched? If the top sequence does not have a function ascribed to it, do any others in the list of SIGNIFICANT matches?
(3) Annotation by BLAST: what nucleotide sequences are your sequences similar to?
Perform a BLAST search of your EST DNA sequence against nucleotide databases.
This will identify
(a) whether your gene has been sequenced previously (although very few Sphaerularia bombi genes have been sequenced previously there are a few - in which case there will be a perfect or near-perfect match),
(b) whether a very similar sequence has been identified in a closely related nematode species and
(c) whether your sequence derives from an RNA gene (such as ribosomal RNA gene).
We suggest you perform three comparisons of your sequence to two databases:
(I) BLASTN against Nucleotide collection (nr/nt) - this is all the nucleotide sequences in the 'normal' division of GenBank
(II) BLASTN against Non-human, non-mouse ESTs (est_others) - this includes the expressed sequence tags from other nematodes.
(III) TBLASTX against Non-human, non-mouse ESTs (est_others) - this includes the expressed sequence tags from other nematodes.
http://www.ncbi.nlm.nih.gov/blast/
Annotation with Putative Function 2
 |
BLAST is good at finding significantly similar matches, but it is not always easy to work out what it is the matched sequences do, in terms of their biology. As proteins are made up of functional domains, we can compare our protein sequence to a database of domains, and use any matches to these as indicators of function.
There are many domain databases, but probably the best one to search is InterPro, a system that aggregates a wide range of different domain databases under one unified search engine.
http://www.ebi.ac.uk/InterProScan/
Go to interproscan and follow the instructions there... |
(a) at the interpro site, go to the InterProScan page and enter your protein sequence

(b) wait
(c) when the results appear, you can review them in the grahic display shown. For each domain mtched there is a significance score. You can go to a detailed description of the biology of the domain by following the links given. Note that as InterPro aggregates many domain databases, one region of your protein can have matches to entries in more than one database.
Is your gene a bee gene?
It is possible that some of the mRNAs we purified came from adhering bee immune cells, or from bee tissue the nematodes had eaten. We can assess whether the gene is a bee gene or a nematode gene by comparing the quality of the BLAST matches to a bee compared to a nematode database.
Basically, if your sequence is more closely related to an Arthropod sequence then it is likely to be from the bee, whereas if it is more similar to a Nematode sequence it is most liklely to be from the nematode.
You can do this using the NCBI BLAST service and FILTERING the results by taxonomic group.
(a) Open a browser window at the NCBI ENTREZ Taxonomy homepage http://www.ncbi.nlm.nih.gov/Taxonomy/taxonomyhome.html/
(b) On the Taxonomy homepage, type in the taxon you would like to see records for ("Arthropoda" or "Nematoda"), and press GO.

(c) On the result page that is returned, select the taxon name ("Arthropoda" or "Nematoda").

(d) You should see a page similar to this:

This view shows a summary of the sequence and other data available for the taxon "Nematoda". There are 378,255 nucleotide and 255,3121 protein sequences.
(e) Select the highlighted number of nucleotide sequences. This will take you to a standard NCBI ENTREZ window showing the first 20 of the sequences. Note the tabs on the page that indicate different kinds of sequence:

What are these different kinds of sequence? What kinds are most useful to you? What do you get when you click on the "Protein" or "Genome" links on the Taxonomy page?
(f) In the query box at the top of the ENTREZ page, you will see that what you have displayed are all nucleotide sequences for the ENTREZ query "txid6231[Organism:exp]" . This taxid is a unique numerical identifier in NCBI ENTREZ for "Nematoda" and so this search has found ALL nematode sequences; obviously the number of the taxon is different for Arthropoda.

This query can be used to limit the matches shown in a BLAST search at NCBI. Copy the query text.
(g) Open a web browser window at the NCBI BLAST start page http://www.ncbi.nlm.nih.gov/blast/
What sort of search do you want to do? Remember that nucleotide-nucleotide searches are optimised for finding nearly exact matches, and that you are looking for similarities between organisms that last shared a common ancestor over 600 million years ago.
If you are searching with a protein translation, choose BLASTP against protein, or TBLASTN against EST-others.
If you are searching with the nucleotide sequence of your EST or of its cluster consensus, use BLASTX against protein, or tBLASTX against EST-others.
Paste your sequence into the search box, choose the nr database, and in the "Choose Search Set " section below paste in your ENTREZ query text into the "ENTREZ query" box. Submit your BLAST search.
(h) The results will show ONLY those matches to your chosen 'limited' database.
Record the best hit and the score and E-value, and also the length of the match and the percentage identity and similarity.
Repeat for both a "Nematoda"- and an "Arthropoda"-limited search. Try both the EST-others (tBLASTx with a nucleotide query) and the protein (BLASTx with a nucleotide query) databases
Your results are available as
hyperlinks from here: GGIII2010 Results the extra sequences are here
Back to the Top
|