BaNG - Blaxter Nematode and Neglected Genomics

Teaching
Mark Blaxter's Teaching WebSite

  at the Institute of Evolutionary Biology, University of Edinburgh
Courses:
 
 

Genomes and Genomics

Technology Session 4

Lecture as a PDF

 

Annotating a Genome:

Identifying genes from raw sequence data

Mark Blaxter

In this tutorial you will be using the genome annotation tool called ARTEMIS to examine two genomic fragments, one from a prokaryotic organism and one from a eukaryotic organism, and attempt to find genes within the sequences. This is not being graded and is an exercise just to introduce you to the practicalities and complexity of dealing with genomes and trying to extract biological information from them. Also don"t feel you have to complete the whole lot: you may not have time to look at the Eukaryotic example.

 
Through the practical there are questions for you to answer in bold italic: do attempt these.

Find yourself a computer, and log on.

Open a text editor (Notepad, Word, ...) and start a new document to record your notes for the Tech Session.

Open a web browser (Firefox, Explorer, ...)

ARTEMIS is a Java application, which means it will run on (almost) any op[erating system. Rather than install it locally we can "run it through the web" by launching the Java application from a remote web page. Go to

http://www.sanger.ac.uk/Software/ARTEMIS/v7/

On the web page that opens, click on "LAUNCH ARTEMIS".

It will take a while to download the Java application. Answer yes to the "security" questions that are asked.

Once the application has been downloaded, the Java "Virtual machine" will launch, and then you will be presented with an opening window.

 

If you want to know more about ARTEMIS, you can go to:

http://www.sanger.ac.uk/Software/Artemis/stable/manual/index.html

 


PROKARYOTIC ANNOTATION

1 OPENING THE SEQUENCE FILE

From the "File" menu in ARTEMIS, select "Open..." and Navigate to

N:\DeptSoftware\SciEng\Biological Sciences\BIOLOGY\GGIII

Set the "File Type" to "All Files"

"Open" the file clbot.fsa

This will launch the ARTEMIS view for the prokaryotic sequence contained in the file clbot.fsa.

(OR YOU CAN GET THE clbot.fsa FILE FROM HERE, DOWNLOAD IT TO YOUR MACHINE, SAVE AS clbot.fsa AND THEN OPEN IT IN ARTEMIS)

2 NAVIGATING IN ARTEMIS

You will be presented with a number of views. The top panel shows an overview of the entire genomic fragment and any associated features (currently none). The top three lines represent the forward three frames of translation (from left to right). The bottom three lines represent the reverse three frames of translation (going from right to left). Each vertical black line represents a "stop" codon.

Play with the sliders below and to the right of each view. You can use the sliders to zoom and move around this view. The numbers indicate the number of bases from the beginning of the sequence.

In the lower panel is a more detailed view of the genomic fragment showing the two strands of DNA and their associated 6 frame peptide translations (represented by the single letter amino acid code). Again the top three frames read from left to right and the bottom three frames read from right to left. By clicking on a amino acid residue you will highlight the three bases (codon) that it refers to. Stop codons are marked #, * and +. When you are zoomed out, the stop codons are represented by vertical black lines.

Why are there three different symbols for stop codons?

You can also "click" and "drag" along the sequences highlighting the bases they encode, note that this also leads to highlighting regions of the overview in the top panel. Again play around with the scroll bars to move and zoom around this window. Note by double clicking a region in the top panel, the bottom panel automatically focuses on the selected region. The bottom panel shows text descriptions of any selected features.

Now we"ve familarised ourselves with the views, lets see if we can find anything interesting from the sequence.

Can you determine the length of the sequence?

Handy Hint : Just under the menu options is a single text pane which describes the current coordinates (with respect to the start of the sequence) of the bases that are currently selected.

 

3 FINDING GENES: OPEN READING FRAMES

Can you identify any long ORFs?

There are a couple of very obvious ones: just look for stretches of sequence where one frame on one strand has no stop codons.

How many open reading frames - and thus protein-coding genes - do you think there are?

ARTEMIS can automatically find ORFs.

From the menu choose "Create" and select "Mark open reading frames".

When the dialog box comes up click on "OK". (For now leave the minimum length of ORF at 100 residues.)

The program will now search for all ORFs which are 100 bp or longer in length.

How many open reading frames longer than 100 amino acids are there?

What are their start and end coordinates?

Handy Hint : If you look in the lower panel, there are now a set of annotations (in this case only the open reading frames you just defined) with their coordinates. Note that some of these annotations have the letter 'c'; what do you think this means?

Do they all start with Methionine (Met, M)?

How many ORFs do you get when you set the minimum length to less than 100 bases? Try 50, or 10!

How do you tell which of these open reading frames - putative protein-coding genes - are real?

What should you be looking for?

Handy Hint: review the lecture notes you made from this morning (and see the notes here)

 

4 OVERLAPPING GENE PREDICTIONS

Note that ARTEMIS identifies an ORF from ~7305 -7646 on the reverse strand. Note also that this overlaps the large ORF spanning ~4123-7719 on the forward strand.

Is this reverse strand open reading frame likely to be a real gene? Why / Why not?

What other lines of evidence (informatic or experimental) could you look at to help you make this decison?

Once genes have been found within a genome, the next thing we want to do is find out what the proteins they encode may do biologically. One of the simplest ways of doing this is to perform a sequence similarity search with the unknown sequence to see if it is significantly similar to any proteins in the database which have been identified before. We"ll perform some simple searches using BLAST of our predicted genes against a protein database.

 

5 EXTRACTING THE PUTATIVE GENE"S PROTEIN SEQUENCE

First you"ll need to select the predicted protein sequence.

For each putative gene, select its ORF by clicking on the blue boxes in the upper panel.

If the ORF doesnt start with a Methionine, you can trim it to the first Met using the "Edit" -> "Trim Selected Features to Met".

Now click on "View" and select "View Amino Acids of Selection as FASTA".

 

You"ll see a panel come up with contents similar to:

> none - 9116: 11011 MW: 71244.16

KNFKGEDIMSLSIKELYYTKDKSINNVNLADGNYVVNRGDGWILSRQNQNLGGNISNNGC

TAIVGDLRIRETATPYYYPTASFNEEYIRNNVQNVFANFTEASEIPIGFEFSKTAPSNKG

LYMYLQYTYIRYEIIKVLRNTVIERAVLYVPSLGYAKSIEFNSGEQIDKNFYFTSEDKCI

LNEKFIYKKIAETTTAKESNDSNNTTNLNTSQTILPYPNGLYVINKGDGYMRTNDKDLIG

TLLIETNTSGSIIQPRLRNTTRPLFNTSNPTLFSQEYTEARLNDAFNIQLFNTSTTLFKF

VEEAPDNKNISMKAYNTYEKYELINYQNGNIADKAEYYLPSLGKCEVSDAPSPQAPVVET

PVEQDGFIQTGPNENIIVGVINPSENIEEISTPIPDDYTYNIPTSIQNNACYVLFTVNTT

GVYKINAQNNLPPLIIYESIGSDNMNIQSNTLSNNNIKAINYITGTDSSNAESYLIVSLI

 

This is what"s known as a FASTA formatted sequence file. The top line (the "FASTA header line") gives information relating to the sequence, typically its unique ID, the organism from which it was derived and other sundry information, preceded by the '>' character. Here we just get a simple line showing the beginning and end bases and predicted molecular weight of the protein. You can safely ignore this information as we're only interested in the sequence.

Select the sequence including the FASTA header (Control-A) and copy it to the clipboard (Control-C). Paste this into your digital notebook for reference.

 

6 BLAST SEARCH TO CONFIRM GENE PREDICTIONS

Now to perform the sequence homology search, you will need to open a web browser and go to NCBI"s BLAST page:

http://www.ncbi.nlm.nih.gov/

Click on BLAST (top toolbar) and select "Protein-Protein BLAST (blastp) " from the "Proteins" set of options.

Paste the FASTA sequence into the box.

Then click the "BLAST!" button (using the default BLAST options is fine for this practical).

You"ll be redirected to a new page...

Click on the "FORMAT!" button to see the results of your search.

[Depending on the load on the BLAST server this may take some time so be patient].

You"ll be presented with the BLAST output.

1) A graphical display of sequences which share similarity with the search sequence you entered

2) A list of the sequences with the score and "E value" of the match. The E value is an indicator of the probability of finding the match at random - all you need to know is that matches which are e-05 or lower are significant matches, a score of 0.0 shows that the proteins are identical (or very nearly so).

3) An alignment for each sequence against your query sequence.

Following this procedure you should be able to deduce the identity of your proteins.

What are they?

Can you guess what sort of organism (or even which species) this fragment of genomic DNA derives from?

What are the protein products likely to do for the organism from which they derive?

What do we use these protein products for?

Does this help you decide which of the overlapping reading frames is 'real'?


EUKARYOTIC ANNOTATION
brugia

If you have time, you can now look at a more complicated genome.

 

7 LOAD THE SECOND SEQUENCE FILE

Go back to the "original" ARTEMIS window - the one that first appeared when you selected it from the "Start" menu.

Click on "File" select "Open" and choose the file bmbac.fsa

N:\DeptSoftware\SciEng\Biological Sciences\BIOLOGY\GGIII\bmbac.fsa

(OR YOU CAN GET THE bmbac.fsa FILE FROM HERE, DOWNLOAD IT TO YOUR MACHINE, SAVE AS bmbac.fsa AND THEN OPEN IT IN ARTEMIS)

This piece of genomic DNA derives from the parasitic nematode Brugia malayi - the same species you assembled the mtDNA of earlier.

A new ARTEMIS sequence information window will pop up.

 

8 FINDING EUKARYOTIC ORFS

You"ll notice immediately from looking at the top panel that there are no obvious large ORFs like there were with the previous fragment of genome.

As before, mark the ORFs which are 100bp or longer... only 1 should appear.

Is this a true ORF?

It may not have a start Met, but it could still be an exon.

Can you find a splice acceptor site (GA) within this "ORF"?

Hints: The genome from which this piece of DNA comes has the following "consensus" for splice acceptor and donor sites. Not every intron conforms to these exactly, but they should help. The lower case letters are less conserved, while the uopper case ones are almost univerally conserved. Italicised bases are part of the exon.

agGT - - - - intron - - - - tttcAGg

donor - - - - - - - - - - - acceptor

What could you look at to verify the presence / absence of an acceptor site?

 

9 IS THIS REGION OF THE BRUGIA MALAYI GENOME EXPRESSED?

We have generated a lot of expressed sequence tags for Brugia malayi. We can use these to see if this region of tthe genome is present in the transcriptome.

Select the region of this short ORF and select "View" -> "View bases of selection as fasta".

Then go to the NCBI page again http://www.ncbi.nlm.nih.gov/ , select BLAST from the top menu, but this time choose "standard nucleotide-nucleotide search" (a BLASTN search). NB: For database, select "est_others" rather than nr.

Is this region of the genome expressed?

Now use the "View" function in ARTEMIS to select the amino acid translation of the putative ORF, and perfom a BLASTP search against the nr protein database.

What are the differences in the data you get back from these two searches?

Which (if either) was the more informative?

 

10 FINDING A REAL GENE

This piece of DNA was chosen for sequencing as it contains one of our genes of interest: ALT-1. ALT-1 was identified by EST sequencing initially, as it is abundantly expressed in the larval stages of the parasite and may be a good vaccine candidate.

The ALT-1 protein sequence starts with MNKLLIAFGLIILTVTL and is > 120 amino acids in length.

Can you find the exons that encode this protein?

Handy Hint: You can find its start Met between 2500 and 2800.

How long is the open reading frame?

You may be able to find its start but notice how there is a stop codon after only ~40 amino acids.

Why is the "gene" so short?

The alt-1 gene is composed of 4 exons. These may be located in different frames (but on the same strand) and will be separated by 10"s-100"s of bases.

 

11 FINDING GENES USING BASE COMPOSITION PLOTS

Another useful feature of ARTEMIS is calculating and plottiing % G/C content.

In Brugia malayi, high G/C content indicates coding regions as the genome is overall A/T rich and coding regions have a more "normal" G/C content.

Can we use this to find the exons ? Click on "Graph" and select "G/C content %".

A graph will be displayed showing the relative G/C content along the stretch of DNA, aligned to the other ARTEMIS views.

Can the peaks in the graph be used to find the exons?

Looking at the G/C content will give you some idea about where the exons may be located.

Can you make a gene out of the exons you have found?

Hint: Look for the splice acceptor signal (tttcAG) as this is quite obvious.

agGT - - - - intron - - - - tttcAGg

donor - - - - - - - - - - - acceptor

 

12 FINDING GENES USING EST DATA

Another way to find genes is to search for matching ESTs. ESTs (expressed sequence tags) are short stretches (<600bp) of sequence from randomly selected mRNAs. They are very good for mining genomes for genes, since a sequence similarity match of a stretch of genomic DNA to an EST indicates that a section which is transcribed and hence part of an exon.

1) Select the forward DNA region from 2500-4500 and create a new feature

2) "Create" -> "Feature from Base Range"

3) "View" -> "View bases of selection as fasta".

4) Select the sequence to the clipboard (as you would for a Word document).

Then go to the NCBI page again http://www.ncbi.nlm.nih.gov/ , select BLAST from the top menu, but this time choose "standard nucleotide-nucleotide search" (a BLASTN search). NB: For database, select "est_others" rather than nr.

Paste your DNA sequence into the search box.

Click on the "BLAST!" button and follow the "Format" link when its presented to you. Hopefully the search won"t take too long.

This stretch of DNA contains 4 exons which should be clearly visible in the graphical view. Each EST has up to FOUR high scoring segments (hsps) against the genomic fragment. These are listed one after the other in the text and alignment portion of the BLAST report.

You can use this detailed BLAST information to map the ESTs (and hence exons) onto the genomic fragment.

 

13 GENEFINDERS

GENEFINDERS are programs that integrate ORF, codon bias, splicing, similarity and other information to produce automated gene predictions. The Brugia malayi sequence could be fed to one of these programs.

First copy the sequence to the clipboard:

1) "Select" -> "All bases"

2) "Create" -> "Feature from base range"

3) "View" -> "View bases of selection as fasta".

4) Select the sequence to the clipboard (as you would for a Word document).

Try one of the following genefinding websites:

genscan http://genes.mit.edu/GENSCAN.html

grail http://grail.lsd.ornl.gov/grailexp/

promoter scan http://bimas.dcrt.nih.gov/molbio/proscan/index.html


The Brugia malayi genome has been sequenced in its entirety (see http://www.tigr.org/tdb/e2k1/bma1/). It was the first sequenced genome of a parasitic nematode. Brugia infects ~20 million people and we hope that this information will help inform drug development and vaccine research. See PubMed for the paper.

 
the content of these pages is copyright Mark Blaxter and colleagues. Contact the webmaster if there are problems.