BaNG - Blaxter Nematode and Neglected Genomics

Genomes & Genomics
Mark Blaxter's Teaching WebSite

  at the Institute of Evolutionary Biology, University of Edinburgh
Courses:
Honours:
 

Introduction to Caenorhabditis elegans

Introduction to Phylogenetics

Genome Sequencing and Annotation for Informatics MSc

MSc in Bioinformatics

The BTO

 
Genomics Practical 2008

velvet worm

EST Sequencing: Analysis part 2

In today's session you will take your Expressed Sequence Tag (EST) sequences, and identify what proteins (if any) they encode and investigate whether they might play a role in the nematode parasite - mammal host interaction.

READ THROUGH THIS DOCUMENT BEFORE STARTING THE DAY'S WORK...

Your results are available as hyperlinks from here: GGIII2008 Results

1 2 3 4 5 6


velvet worm 1 velvet worm

Log in, launch a web browser, and direct it to
http://xyala.cap.ed.ac.uk/teaching/genomics/Practical_2008/practical_results2.html
(this page)

You should also launch MS Word or Notepad and open a file to act as your notebook for the practical.

This practical will be assessed, and thus you should check the requirements for the writeup
(see here)
before you continue through the sequence analysis work below.


velvet worm 2 velvet worm

Finding the open reading frame and
Identifying Spliced Leaders

SPLICED LEADERS

In nematodes (such as the model nematode Caenorhabditis elegans) only some of the mRNAs have a tSL, and there are multiple different tSL sequences, forming two major groups (SL1-like and SL2-like). (C. elegans also does cis-splicing)

In Heligmosomoides polygyrus … we expect there to be trans-splicing and we predict that there will be two kinds of tSLs:

SL1-like

GGTTTAATTACCCAAGTTTGAG
GGTTTAATAACCCAAGTTTGAG

SL2-like

GGTTTTAACCCAGTTACTCAAG

GGTTTTAACCCAGTTAACCAAG

GGTTTTAACCCAGTTTAACCAAG

GGTTTTAACCCAGTTACCAAG

GGTTTAAAACCCAGTTACCAAG

GGTTTTAACCCAGTTAATTGAG

GGTTTTTACCCAGTTAACCAAG

GGTTTATACCCAGTTAACCAAG

GGTTTTAACCCAAGTTAACCAAG

GGTTTTAACCAGTTAACTAAG

GTTTTAACCCATATAACCAAG

GGTTTTAACCCAGTTAACTAAG

GGTTTTAACCCAGTTACTCAAG

GGTTTTAACCCAGTTAACCAAG

GGTTTTAACCCAGTTTAACCAAG

GGTTTTAACCCAAGTTAACCAAG

GGATTTATCCCAGATAACCAAG

GGTTTTTACCCTGATAACCAAG

GGTAATTAACCAAGTATCTCAAG

GGTTAATACCCAGTATCTCAAG

GGTAATTAACCCAGTATCTCAAG

GGTAATTACCCAGTATCTCAAG

GGTTTAAACCCAGTATCTCAAG

GGTTTTTACCCGGTATCTTAAG

How to recognise SLs in your sequences.

(1) look at the pattern of bases at the very 5’ end of the insert sequence
(2) if there is a match to “TTTGAG” it is very likely that there is an SL1 on your cDNA
(3) if there is a match to “TCAAG”, “CCAAG”, TTAAG” or “CTAAG” it is very likely that there is an SL2-like SL on your cDNA

As the SL marks the 5’ end of a mRNA, if you have an SL on your cDNA it implies that the first ATG (methionine) codon downstream of the SL will be the initiation codon.

OPEN READING FRAMES

Do your ESTs have open reading frames?

Of course they do, even if they are only very short... How do you identify the longest (and thus most likely) one?

There are several ways to find out:

(1) by eye (painful unless you happen to have memorised the 64-codon table of the genetic code, which of course you should have ;-)

(2) using ARTEMIS as you did in TechSession4.

Save your sequence as a FASTA file (where the name of the sequence is on the first line, preceeded by a '>', and the sequence on second and subsequenct lines. This file MUST be in raw or plain text format, not in Word or other program specific format)

e.g
>HP_ADY_000A00 my cool sequence
AGAGACAGTATCCGCGATAGAGCAGATCGGACAGCTAGGACA
GCTAGGACCGCTCGGACCGGTGTCGACAAGCTGACACAGCTA
GCACAGCT

Open the fasta file in ARTEMIS (see HERE if you have forgotten how!) and look for ORFs.

(3) Using online translation programmes.

For these you just paste your sequence in a window, choose the genetic code, and press go. The DNA is translated into protein and you can then choose which you believe might be the correct open reading frame.

Try: EMBOSS Transeq at http://www.ebi.ac.uk/emboss/transeq/

or the EXPASY Translate Tool at http://expasy.org/tools/dna.html


velvet worm 3 velvet worm

Annotating your genes with putative functions

Now you need to find out what the gene you have cloned might do for Heligmosomoides polygyrus .

To annotate your genes you need to use a range of programmes. For each of your ESTs perform all of the following searches and RECORD what the results are. You will use these in your writeup to compile a "molecular CV" for each gene.

The tools we suggest using are developed by others worldwide (US, UK, Japan and Denmark in the current set below) and offered as a web service. These websites offer access to some very powerful programmes, and to use them properly you should have a look at the "instructions', 'how to's and 'FAQ (frequently asked questions'...

(1) Annotation by BLAST: what proteins are your sequences similar to?

Performing a BLAST search of your EST DNA sequence, using BLASTX against a universal protein database such as nr will identify previously sequenced genes/proteins that may have functional annotation. If your sequence is sufficiently similar to the database sequence, it is very likely that it has a similar function. . By using BLASTX you can also identify frameshifts and other errors in your sequence.

http://www.ncbi.nlm.nih.gov/blast/

(2) Annotation by BLAST: what proteins are your sequences similar to? (take 2)

Performing a BLAST search of the protein sequence from your EST, using BLASTP against a protein database will identify whether your sequence encodes peptides similar to anything previously sequenced. This may be more efficient than a BLASTX search as it will only be using the peptide from the open reading frame you have identified as being very likely to be the correct one

http://www.ncbi.nlm.nih.gov/blast/

In examining the BLAST report, you might be interested to look at the taxonomic sources of the best-matching sequences identified. You can access these using the 'Taxonomy Reports" link that appears near the top of the BLAST report (above the image showing the matches). In this report is a simplified lineage of the organisms against which BLAST matches were found.

(3) Annotation by BLAST: what nucleotide sequences are your sequences similar to?

Performing a BLAST search of your EST DNA sequence, using BLASTN against a nucleotide database will identify

(a) whether your gene has been sequenced previously (although very few H. polygyrus genes have been sequenced previously there are a few - in which case there will be a perfect or near-perfect match),

(b) whether a very similar sequence has been identified in a closely related nematode species and

(c) whether your sequence derives from an RNA gene (such as ribosomal RNA gene).

http://www.ncbi.nlm.nih.gov/blast/

(4) Annotation of protein domains in your sequence

If you have a protein translation, does it have a match to any protein family or domain pattern signatures?

Protein domains are modular segments of proteins that (usually) carry out eno of the modular functions of a protein, such as mediating protein-protein interaction, carrying out some enzymatic activity, or mediating interaction with biological membranes. Protein familes are collections of proteins that share sequence similarity and are likely to be involved in similar biological processes.

Domains and families can be searched in several places, using several different databases. The European Bioinformatics Institute (EBI) has developed a 'unified' system called INTERPRO available at http://www.ebi.ac.uk/interpro/. The INTERPROSCAN tool, at http://www.ebi.ac.uk/InterProScan/ lets you paste in your protein sequence ands search all the different databases of families and domains simultaneously.

http://www.ebi.ac.uk/InterProScan/

(5) Inferring cellular localisation

Where is the protein from your gene likely to be located in the cell?

Secreted proteins have a signal peptide, membrane proteins have membrane-spanning segments, nuclear located proteins have a nuclear localisation signal, etcetera. You can search for these sorts of patterns in the server offered by the PSORTII consortium at

http://psort.nibb.ac.jp/ [Use the PSORTII search form.]

The Psort folks have a new Psort tool too, called Wolf PSort, which you can also try... at http://wolfpsort.org/

(6) Is your protein likely to be cell surface located or secreted?

Secreted proteins and membrane proteins have a particular signature of transmembrane segments. While PSort also predicts membrane association and topology, the dedicated SignalP server focusses on secretion and membrane association. Use the SignalP 3.0 server system at

http://www.cbs.dtu.dk/services/SignalP/ .


velvet worm 4 velvet worm

Idenifying genes that could be involved in host immune interactions

H. polygyrus genes that play roles in host-parasite interaction might be expected to display one or more characteristics including:

(1) secretion from the nematode

(2) interaction with host receptors (protein-protein interaction)

(3) biochemical activities that might affect host physiology

(4) mimicry or similarity to host proteins, especially signalling molecules and regulators (next section)

For your writeup we are asking you to examine all your four ESTs for potential roles in host-parasite interaction, and then to choose one to discuss in more detail. For this 'chosen one' you will need to consider the evidence you have concerning its functional role in the nematode and provide an argument, backed up by your bioinformatic experiment evidence, as to why it might be a good candidate for further study. Obviously, some candidates may have evidence that conflicts concerning their possible roles, and so you will need to weigh up these different strands of information.

Secretion

This would be predicted if the analyses with PSort and SignalP suggeted the presence of secretory signal peptides.

Interaction

Does your protein have similarities to proteins that are receptors or ligands for receptors? Does it contain a protein-protein interaction domain? Does it contain a domain that binds components of the extracellular matrix (for example, lectins are proteins that ind carbohydrate molecules, often those attached to cell surface or extracellular proteins). You will have gathered this information using the BLAST search analyses and domain searches above.

Biochemical activity

Does your protein have a biochemical activity that might interact with the host and modify the environment to benefit the parasite? Is the protein providing some function that the parasite might require in order to survive in the host?

MImicry (see below)

 


velvet worm 6 velvet worm

Detecting molecular mimicry

Is your protein significantly similar to a host protein? Could it be mimicing the functions of a host protein?

One way to tell is to determine whether the H. polygyrus protein is more similar to the mouse homiologue than expected: is the similarity greater than would be expected on the basis of the level of evolutionary conservation of the protein in all animals, or on the basis of the level of conservation of other H. poly proteins to homologous mouse proteins?

There are two ways to do this: using the NCBI BLAST service and FILTERING the results by taxonomic group, or by constructing taxonomically restricted databases and performing BLAST searches on these.

Using the NCBI BLAST service

Open a browser window at the NCBI ENTREZ Taxonomy homepage

http://www.ncbi.nlm.nih.gov/Taxonomy/taxonomyhome.html/

On the Taxonomy homepage, type in the taxon you would like to see records for (mouse), and press GO.

On the result page that is returned, select, the correct taxon. In this case it is "Mus musculus".

mus musculus info

This view shows a summary of the sequence and other data available for the taxon "Mus musculus ". There are 8.7 million nucleotide and 239,000 protein sequences. [Questions in passing: Why so few protein sequences compared to nucleotide? Why so many protein sequences if mouse has only ~25,000 protein coding genes?]

Select the highlighted number of protein sequences. This will take you to a standard NCBI ENTREZ window showing the first 20 of the sequences.

mus musculus proteins

In the query box at the top of the ENTREZ page, you will see that what you have displayed are all protein sequences for the ENTREZ query "txid10090[Organism:exp]" . This taxid is a unique numerical identifier in NCBI ENTREZ for "Mus musculus " and so this search has found ALL Mus musculus sequences; obviously the number of the taxon is different for each taxon. This query can be used to limit the matches shown in a BLAST search at NCBI. Copy the query text "txid10090[Organism:exp]".

Open a web browser window at the NCBI BLAST start page http://www.ncbi.nlm.nih.gov/blast/

What sort of search do you want to do? Remember that nucleotide-nucleotide searches are optimised for finding nearly exact matches, and that you are looking for similarities between organisms that last shared a common ancestor over 600 million years ago.

If you are searching with a protein translation, choose BLASTP.

If you are searching with the nucleotide sequence of your EST or of its cluster consensus, use BLASTX.

Paste your sequence into the search box, choose the nr (nonredundant proteins) database, and in the "Choose search set " section paste in your ENTREZ query text into the "ENTREZ query" box. Submit your BLAST search.

mus musculus blast

The results will show ONLY those matches to your chosen 'limited' database.

Record the best hit and the score and E-value, and also the length of the match and the percentage identity and similarity.

How do these compare with the best match found in all of the non-redundant database from all organisms?

Is the match to mouse better than the match to, for example, other nematodes?

Is the match better than you would expect based on the phylogenetic relationship between mouse and nematode?

This last question is hard to test, but to help you answer it, we would expect a nematode sequence to be more similar to one from arthropods (nematodes and arthropods are both protostome animals, which last shared a common ancestor more recently than either did with mouse, a deuterostome). So one way to check would be to compare the BLAST-based similarity score you get whan comparing your protein to those of arthropods and those of mice. You can perform an arthropod-specific search using the same strategy as you used for the mouse search above.


Your results are available as hyperlinks from here: GGIII2008_Results | Writeup instructions

Back to the Top  

 

the content of these pages is copyright Mark Blaxter and colleagues. Contact the webmaster if there are problems.