Genomes and Genomics
Technology Session 4
Lecture as a PDF

Annotating a Genome:
Identifying
genes from raw sequence data
Mark Blaxter
In this tutorial you will be using
the genome annotation tool called ARTEMIS to examine two genomic fragments, one from a prokaryotic organism and
one from a eukaryotic organism, and attempt to find genes within the
sequences. This is not being graded and is an exercise just to
introduce you to the practicalities and complexity of dealing with
genomes and trying to extract biological information from them. Also
don"t feel you have to complete the whole lot: you may not have time
to look at the Eukaryotic example.
Through the practical
there are questions for you to answer in bold italic:
do attempt these.
Find yourself a computer, and log
on.
Open a text editor (Notepad, Word, ...)
and start a new document to record your notes for the Tech
Session.
Open a web browser (Firefox, Explorer,
...)
ARTEMIS is a Java application, which
means it will run on (almost) any op[erating system. Rather than
install it locally we can "run it through the web" by launching the
Java application from a remote web page. Go to
http://www.sanger.ac.uk/Software/ARTEMIS/v7/
On the web page that opens, click on
"LAUNCH ARTEMIS".
It will take a while to download the
Java application. Answer yes to the "security" questions that are
asked.
Once the application has been
downloaded, the Java "Virtual machine" will launch, and then you will
be presented with an opening window.

If you want to know more about ARTEMIS,
you can go to:
http://www.sanger.ac.uk/Software/Artemis/stable/manual/index.html
PROKARYOTIC
ANNOTATION
1 OPENING THE SEQUENCE
FILE
From the "File" menu in ARTEMIS, select
"Open..." and Navigate to
N:\DeptSoftware\SciEng\Biological
Sciences\BIOLOGY\GGIII
Set the "File Type" to "All
Files"
"Open" the file clbot.fsa
This will launch the ARTEMIS view for
the prokaryotic sequence contained in the file
clbot.fsa.
(OR YOU CAN GET
THE clbot.fsa FILE FROM HERE,
DOWNLOAD IT TO YOUR MACHINE, SAVE AS clbot.fsa AND THEN OPEN IT IN
ARTEMIS)
2 NAVIGATING IN
ARTEMIS
You will be presented with a number of
views. The top panel shows an overview of the entire genomic fragment
and any associated features (currently none). The top three lines
represent the forward three frames of translation (from left to
right). The bottom three lines represent the reverse three frames of
translation (going from right to left). Each vertical black line
represents a "stop" codon.
Play with the sliders below and to
the right of each view. You can use the sliders to zoom and move
around this view. The numbers indicate the number of bases from the
beginning of the sequence.
In the lower panel is a more detailed
view of the genomic fragment showing the two strands of DNA and their
associated 6 frame peptide translations (represented by the single
letter amino acid code). Again the top three frames read from left to
right and the bottom three frames read from right to left. By
clicking on a amino acid residue you will highlight the three bases
(codon) that it refers to. Stop codons are marked #, * and +. When
you are zoomed out, the stop codons are represented by vertical black
lines.
Why are there three
different symbols for stop codons?
You can also "click" and "drag" along
the sequences highlighting the bases they encode, note that this also
leads to highlighting regions of the overview in the top panel. Again
play around with the scroll bars to move and zoom around this window.
Note by double clicking a region in the top panel, the bottom panel
automatically focuses on the selected region. The bottom panel shows
text descriptions of any selected features.
Now we"ve familarised ourselves with the
views, lets see if we can find anything interesting from the
sequence.
Can you determine the
length of the sequence?
Handy Hint : Just under the menu
options is a single text pane which describes the current
coordinates (with respect to the start of the sequence) of the
bases that are currently selected.
3 FINDING GENES: OPEN
READING FRAMES
Can you identify any long
ORFs?
There are a couple of very obvious ones:
just look for stretches of sequence where one frame on one strand has
no stop codons.
How many open reading frames - and thus protein-coding genes - do you
think there are?
ARTEMIS can automatically find ORFs.
From the menu choose "Create" and select
"Mark open reading frames".
When the dialog box comes up click on
"OK". (For now leave the minimum length of ORF at 100
residues.)
The program will now search for all ORFs
which are 100 bp or longer in length.
How many open reading frames longer than 100 amino acids are there?
What are their start and end
coordinates?
Handy Hint : If you look in the lower panel, there are now a set of annotations (in this case only the open reading frames you just defined) with their coordinates. Note that some of these annotations have the letter 'c'; what do you think this means?
Do they all start with
Methionine (Met, M)?
How many ORFs do you get when you set
the minimum length to less than 100 bases? Try 50, or 10!
How do you tell which of these open reading frames - putative protein-coding genes - are real?
What should you be looking
for?
Handy Hint: review the lecture notes you made from this morning (and see the notes here)
4 OVERLAPPING GENE
PREDICTIONS
Note that ARTEMIS identifies an ORF from
~7305 -7646 on the reverse strand. Note also that this overlaps the large
ORF spanning ~4123-7719 on the forward strand.
Is this reverse strand
open reading frame likely to be a real gene? Why / Why not?
What other lines of evidence
(informatic or experimental) could you look at to help you make
this decison?
Once genes have been found within a
genome, the next thing we want to do is find out what the proteins
they encode may do biologically. One of the simplest ways of doing
this is to perform a sequence similarity search with the unknown
sequence to see if it is significantly similar to any proteins in the
database which have been identified before. We"ll perform some simple
searches using BLAST of our predicted genes against a protein
database.
5 EXTRACTING THE PUTATIVE
GENE"S PROTEIN SEQUENCE
First you"ll need to select the
predicted protein sequence.
For each putative gene, select its ORF
by clicking on the blue boxes in the upper panel.
If the ORF doesnt start with a
Methionine, you can trim it to the first Met using the "Edit" ->
"Trim Selected Features to Met".

Now click on "View" and select "View
Amino Acids of Selection as FASTA".

You"ll see a panel come up with contents
similar to:
> none - 9116:
11011 MW: 71244.16
KNFKGEDIMSLSIKELYYTKDKSINNVNLADGNYVVNRGDGWILSRQNQNLGGNISNNGC
TAIVGDLRIRETATPYYYPTASFNEEYIRNNVQNVFANFTEASEIPIGFEFSKTAPSNKG
LYMYLQYTYIRYEIIKVLRNTVIERAVLYVPSLGYAKSIEFNSGEQIDKNFYFTSEDKCI
LNEKFIYKKIAETTTAKESNDSNNTTNLNTSQTILPYPNGLYVINKGDGYMRTNDKDLIG
TLLIETNTSGSIIQPRLRNTTRPLFNTSNPTLFSQEYTEARLNDAFNIQLFNTSTTLFKF
VEEAPDNKNISMKAYNTYEKYELINYQNGNIADKAEYYLPSLGKCEVSDAPSPQAPVVET
PVEQDGFIQTGPNENIIVGVINPSENIEEISTPIPDDYTYNIPTSIQNNACYVLFTVNTT
GVYKINAQNNLPPLIIYESIGSDNMNIQSNTLSNNNIKAINYITGTDSSNAESYLIVSLI
This is what"s known as a FASTA
formatted sequence file. The top line (the "FASTA header line") gives information relating to
the sequence, typically its unique ID, the organism from which it was
derived and other sundry information, preceded by the '>' character. Here we just get a simple line
showing the beginning and end bases and predicted molecular weight of
the protein. You can safely ignore this information as we're only
interested in the sequence.
Select the sequence including the FASTA
header (Control-A) and copy it to the clipboard (Control-C). Paste
this into your digital notebook for reference.
6 BLAST SEARCH TO CONFIRM
GENE PREDICTIONS
Now to perform the sequence homology
search, you will need to open a web browser
and go to NCBI"s BLAST page:
http://www.ncbi.nlm.nih.gov/
Click on BLAST (top toolbar) and select
"Protein-Protein BLAST (blastp) " from the "Proteins" set of options.
Paste the FASTA sequence into the
box.
Then click the "BLAST!"
button (using the default BLAST options is fine for this practical).
You"ll be redirected to a new
page...
Click on the "FORMAT!" button to see the
results of your search.
[Depending on the load on the BLAST
server this may take some time so be patient].
You"ll be presented with the BLAST
output.
1) A graphical display of
sequences which share similarity with the search sequence you
entered
2) A list of the sequences with the
score and "E value" of the match. The E value is an indicator of
the probability of finding the match at random - all you need to
know is that matches which are e-05 or lower are significant
matches, a score of 0.0 shows that the proteins are identical (or
very nearly so).
3) An alignment for each sequence
against your query sequence.
Following this procedure you should be
able to deduce the identity of your proteins.
What are they?
Can you guess what sort of
organism (or even which species) this fragment of genomic DNA
derives from?
What are the protein products likely to do for
the organism from which they derive?
What do we use these protein products for?
Does this help you decide which of the overlapping reading frames is 'real'?
EUKARYOTIC ANNOTATION
If you have time, you can now look at a
more complicated genome.
7 LOAD THE SECOND SEQUENCE
FILE
Go back to the "original" ARTEMIS window
- the one that first appeared when you selected it from the "Start"
menu.
Click on "File" select "Open" and choose
the file bmbac.fsa
N:\DeptSoftware\SciEng\Biological
Sciences\BIOLOGY\GGIII\bmbac.fsa
(OR YOU CAN GET
THE bmbac.fsa FILE FROM HERE,
DOWNLOAD IT TO YOUR MACHINE, SAVE AS bmbac.fsa AND THEN OPEN IT IN
ARTEMIS)
This piece of genomic DNA derives from
the parasitic nematode Brugia malayi - the same species you
assembled the mtDNA of earlier.
A new ARTEMIS sequence information
window will pop up.
8 FINDING EUKARYOTIC
ORFS
You"ll notice immediately from looking
at the top panel that there are no obvious large ORFs like there were
with the previous fragment of genome.
As before, mark the ORFs which are 100bp
or longer... only 1 should appear.
Is this a true
ORF?
It may not have a start Met, but it
could still be an exon.
Can you find a splice
acceptor site (GA) within this "ORF"?
Hints: The genome from which this
piece of DNA comes has the following "consensus" for splice
acceptor and donor sites. Not every intron conforms to these
exactly, but they should help. The lower case letters are less
conserved, while the uopper case ones are almost univerally
conserved. Italicised bases are part of the exon.
agGT - - - - intron - - -
- tttcAGg
donor - - - - - - - - - - -
acceptor
What could you look at to
verify the presence / absence of an acceptor
site?
9 IS THIS REGION OF THE
BRUGIA MALAYI GENOME EXPRESSED?
We have generated a lot of expressed
sequence tags for Brugia malayi. We can use these to see if
this region of tthe genome is present in the
transcriptome.
Select the region of this short ORF and
select "View" -> "View bases of selection as fasta".
Then go to the NCBI page again http://www.ncbi.nlm.nih.gov/
, select BLAST from the top
menu, but this time choose "standard nucleotide-nucleotide search" (a
BLASTN search). NB: For database, select "est_others" rather than
nr.
Is this region of the
genome expressed?
Now use the "View" function in ARTEMIS
to select the amino acid translation of the putative ORF, and perfom
a BLASTP search against the nr protein database.
What are the differences
in the data you get back from these two searches?
Which (if either) was the more
informative?
10 FINDING A REAL GENE
This piece of DNA was chosen for
sequencing as it contains one of our genes of interest: ALT-1. ALT-1
was identified by EST sequencing initially, as it is abundantly
expressed in the larval stages of the parasite and may be a good
vaccine candidate.
The ALT-1 protein sequence starts with
MNKLLIAFGLIILTVTL and is > 120 amino acids in length.
Can you find the exons that encode this protein?
Handy Hint: You can find its start Met
between 2500 and 2800.
How long is the open reading
frame?
You may be able to find its start but
notice how there is a stop codon after only ~40 amino
acids.
Why is the "gene" so
short?
The alt-1 gene is composed of 4
exons. These may be located in different frames (but on the same
strand) and will be separated by 10"s-100"s of bases.
11 FINDING GENES USING BASE
COMPOSITION PLOTS
Another useful feature of ARTEMIS is
calculating and plottiing % G/C content.
In Brugia malayi, high G/C
content indicates coding regions as the genome is overall A/T rich
and coding regions have a more "normal" G/C content.
Can we use this to find the exons ?
Click on "Graph" and select "G/C content %".
A graph will be displayed showing the
relative G/C content along the stretch of DNA, aligned to the other
ARTEMIS views.
Can the peaks in the
graph be used to find the exons?
Looking at the G/C content will give you
some idea about where the exons may be located.
Can you make a gene out
of the exons you have found?
Hint: Look for the splice acceptor
signal (tttcAG) as this is quite obvious.
agGT - - - - intron - - -
- tttcAGg
donor - - - - - - - - - - -
acceptor
12 FINDING GENES USING EST
DATA
Another way to find genes is to search
for matching ESTs. ESTs (expressed sequence tags) are short stretches
(<600bp) of sequence from randomly selected mRNAs. They are very
good for mining genomes for genes, since a sequence similarity match
of a stretch of genomic DNA to an EST indicates that a section which
is transcribed and hence part of an exon.
1) Select the forward DNA
region from 2500-4500 and create a new feature
2) "Create" -> "Feature from Base
Range"
3) "View" -> "View bases of
selection as fasta".
4) Select the sequence to the
clipboard (as you would for a Word document).
Then go to the NCBI page again http://www.ncbi.nlm.nih.gov/
, select BLAST from the top
menu, but this time choose "standard nucleotide-nucleotide search" (a
BLASTN search). NB: For database, select "est_others" rather than
nr.
Paste your DNA sequence into the search
box.
Click on the "BLAST!" button and follow
the "Format" link when its presented to you. Hopefully the search
won"t take too long.
This stretch of DNA contains 4 exons
which should be clearly visible in the graphical view. Each EST has
up to FOUR high scoring segments (hsps) against the genomic fragment.
These are listed one after the other in the text and alignment
portion of the BLAST report.
You can use this detailed BLAST
information to map the ESTs (and hence exons) onto the genomic
fragment.
13
GENEFINDERS
GENEFINDERS are programs that integrate
ORF, codon bias, splicing, similarity and other information to
produce automated gene predictions. The Brugia malayi sequence
could be fed to one of these programs.
First copy the sequence to the
clipboard:
1) "Select" -> "All bases"
2) "Create" -> "Feature from base
range"
3) "View" -> "View bases of
selection as fasta".
4) Select the sequence to the
clipboard (as you would for a Word document).
Try one of the following genefinding
websites:
genscan http://genes.mit.edu/GENSCAN.html
grail http://grail.lsd.ornl.gov/grailexp/
promoter scan http://bimas.dcrt.nih.gov/molbio/proscan/index.html

The Brugia malayi genome has been sequenced in its
entirety (see http://www.tigr.org/tdb/e2k1/bma1/). It was the
first sequenced genome of a parasitic nematode. Brugia infects ~20 million people and we hope that
this information will help inform drug development and
vaccine research. See PubMed for the paper.
