Genomes and
Genomics 3
Tech session 1
Assembling
sequences
Mark Blaxter
this file is at
http://www.nematodes.org/teaching/Tech1_Sequencing/index.shtml
The lecture presentation is available
as a pdf
here
General Notes for
TechSessions
These Tech sessions
are meant to give you experience of the tools used in Genome Biology,
and to help you understand some of the issues and concepts used. They
can be examined, and some (not this one) also have work to hand in.
They are generally available on the web, and after the session, some
answers to questions posed may be posted on the websites.
It is good practice
for you to keep a "notebook" of what you do in the Tech sessions. We
suggest you make notes of your workings in, for example, a Word
document or Notepad document. You can use this document to record the
settings of the programs you use, the results (by copying and pasting
from other program windows), and your conclusions and questions. If
you keep these notes carefully, they will stand you in good stead for
later Tech sessions, the practical, and for revision.
Assembling
sequences
In this practical session you will be
taking a set of sequence reads derived from a mitochondrial genome,
and using one of the programs used by the big genome sequencing
centres to "assemble" these sequence reads into a complete genome.
The mitochondrial genome is circular (in animals) and ~14000 base
pairs in length. It encodes 2 ribosomal RNAs and 12 or 13 proteins
(depending on which species is being examined). It also encodes 20
transfer RNAs.
The
Brugia malayi mitochondrion
a circular
molecule, with 12 protein coding genes and two rRNAs, just
under 13,600 base pairs in length.
|
 |
The sequencing reads were generated
by Jennifer Daub, a research scientist in IEB, University of
Edinburgh, as part of the Brugia malayi genome project. See
the Brugia
malayi genome project
(http://www.nematodes.org/fgn/index.shtml) home page for more information on
this parasitic nematode.
The reads are each 300-600 bases
long, and have been pre-trimmed of vector and "low quality"
sequence.
You will use the program "Cap" (which
stands for contig assembly program) which is available through the
world wide web. CAP was developed by Xiaoqiu Huang; the programme
website is http://deepc2.psi.iastate.edu/aat/cap/capdoc.html
1 IN A NEW BROWSER WINDOW
Open
the data file sequences.html
In the file are 69 shotgun sequences,
derived by performing sequencing reads on shotgun clones of the B.
malayi mitochondrial genome.
Each clone is named "clX" where X
is a number, and each sequence is labelled _forward or _reverse depending on
whether it was derived from using a primer directed to the
left or f orward end or the right or
reverse end of the insert.
Thus "cl123_forward" and "cl123_reverse"
form a "read pair" from each end of a single clone.
Each of the 40 sequences is in
"fasta" format.
Fasta format is a simple format for
DNA and protein sequences. The first line starts with a ">"
symbol, followed by a description of the sequence (a definition
line) and then a carriage return/new line. The definition in this
case includes the sequence name, eg "clone123_r.fa : TYPE = DNA".
The next lines are the sequence, and the end of each sequence
record is simply the next ">" symbol.
The same data are in the file BM_mt.fsa, but without the html web markup and other information.
2 IN A NEW BROWSER WINDOW
Open a link to the CAP3
analysis server at the Japanese Human Genome Center
The Japanese Human Genome Center server offers bioinformatic tools to all to use. The Center is heavily involved in sequencing, annotating and understanding the human and other genomes.
3 Send your sequences to Japan
Select all of the sequence data
(but ONLY sequence data) in sequence.html from the first browser window (or download BM_mt.fsa to your local user space, and upload it using the "FILE UPLOAD(FASTA format)" interactive BROWSE button) and paste it into the window "TEXT DATA , Please paste your sequences (FASTA format)". [Don't put my name in the email address line... please!]

There are lots of additional options below the sequence entry part... Skim down the page, UNCHECKING all of these. The software at HGC allows the user to screen for known repeats, vector sequence contamination and other things. Leave the PERFORM SEQUENCE ASSEMBLY option ON. As the sequence you are assembling is (a) relatively simple and (b) does not come from one of the well studies model organisms, these options are not necessry.
You can always come back to this point, turn the different alaysis options ON again, and see what changes...
4 Select "SUBMIT" by clicking on
the button .
Your data will be sent to the remote
computing facility which will carry out the alignment.
Depending on the level of
other users on these computers, the assembly will take from 2-10
minutes. If nothing happens, or you get some 'time out' or other errors, try the alternate CAP3 server at Iowa State University (http://deepc2.psi.iastate.edu/aat/cap/cap.html) If you use this server the output files are more numerous than at HGC, but there will be ones called ..singlets, ...contigs and ...capout, which correspond to the singlets, contigs and assembly details files mentioned below.
Please be patient while this happens:
DON'T keep clicking on the page while it is loading...

5 Examine the results of the
assembly.
Contigs is the assembled
sequences in fasta format
a "contig" is a set of
contiguous sequences
How many contigs did you
get?
Each line is 60 bases long: how
much sequence in total has been assembled? (hint - copy and pasete the sequences into Word and use Tools->Word count, or just count
lines and multiply by 60)
To assemble the sequences, the
program first compares each individual sequence to all the others,
remembering which had significant matches. It then attempts to align
all the ones with matches together.
We have used the default parameters
for the assembly, but it is possible to change these: for example,
what length and %identity is considered a "match".

Single sequences is a list
of the single sequences (called "singletons") that didn't assemble
into contigs.
How many sequences didn't
assemble
?
Assembly details gives an overview of the way cap3 put the
sequence files together.
In the table describing the contigs,
the read name is given, followed by a "+" or a "-": these symbols
indicate the orientation of the read in the contig.

The reads are given in the order in
which they appear in the assembled contig, reading from left to
right. (So, in the above table, the contig starts with 20_f (which is
on the + strand). 20_f overlaps with 19_r (on the - strand) , then 3_r (+), then 26_f (+) joins the contig,
etcetera. The text "clone4_f- is in clone 3_f-" means that the
sequence of the f read from clone 4 is completely contained within
the f read from clone 3.
This is where folk often get
confused! Each clone has a forward (f) and reverse (r) read.
The contig has a "+" and a "-"
strand.
Each pair of clone reads therefore
SHOULD appear in the assembly with the f and r reads on opposite
strands. It may be that the F read is assembled on the + strand, in
which case the R read should be on the - strand (and vice versa).
Because the clones are random insertions of the DNA into the vector,
and thus can be in either orientation, it is not the case that all F
reads should be on the + strand, etc.
Sometimes the two strands of DNA are called Watson and Crick, or W/C, or top and
bottom)
Below the table are the alignments
for each contig.
Look at these.

6 Questions for you to
answer....
a What did the program have to
do?
How many sequence
comparisons did it carry out?
The text "Number of segment pairs"
means the number of matches of any quality between sequences
that were found
"number of pairwise comparisons"
indicates the number of final comparisons that were used in
assembly
Why so many? Why so few?
Why is it very difficult to
assemble thousands or millions of reads into a genome?
b Draw maps of the sequence reads
and their place in the contigs.
For example, here is a map
of a short contig showing left and right reads with different
coloured arrows. I have aligned at the same horizontal level reads
from the same clone.
c Is the assembly a good
one?
Why did you get more than
one contig?
What does the program do
when there are "N" residues (undetermined bases) in the
sequences? (there are N's in
cl48_r and cl35_f)
What does the program do when
there are discrepancies between reads? (see,
for example, the 5' endf of cl24_f)
Are there any problem areas?
Why do you think these problems
might have arisen?
What could be done to fix this sort of
problem?
What additional problems might these
"fixes" create?
d Completing this small
genome...
Can you see any way to link
the contigs, using either the sequence data you have, or the clone
information?
Look at the clone names: each
should have both left and right reads, but for some clones only
one is available. The left and right reads from one clone
should assemble with the correct relative orientation in
the final genome.
The map above ("contig2") suggests that
by finding read 4_f I might be able to link this contig with
another to the right, as 4_r and 4_f come from the same clone.
Similarly, finding 9_r and / or 23_f would link the left end of
this contig.
>here is a precomputed
set of cap results for these files....
"Answers" to the questions asked (pdf) - this is a very old set of specific answers, but the general answers are still valid ;-)
Mark Blaxter