Acrobeles complexus

Mark Blaxter's teaching pages

@ The Blaxter Lab, Institute of Evolutionary Biology, School of Biological Sciences, The University of Edinburgh

University of Edinburgh crest   
Genomes and Genomics 3
Tech session 1
Assembling sequences
Mark Blaxter

this file is at http://www.nematodes.org/teaching/gg3/Tech1_Sequencing/index.shtml

The lecture presentation is available as a pdf here


General Notes for TechSessions

These Tech sessions are meant to give you experience of the tools used in Genome Biology, and to help you understand some of the issues and concepts used. They can be examined, and some (not this one) also have work to hand in. They are generally available on the web, and after the session, some answers to questions posed may be posted on the websites.

It is good practice for you to keep a "notebook" of what you do in the Tech sessions. We suggest you make notes of your workings in, for example, a Word document or Notepad document. You can use this document to record the settings of the programs you use, the results (by copying and pasting from other program windows), and your conclusions and questions. If you keep these notes carefully, they will stand you in good stead for later Tech sessions, the practical, and for revision.


Assembling sequences

In this practical session you will be taking a set of sequence reads derived from a mitochondrial genome, and using one of the programs used by the big genome sequencing centres to "assemble" these sequence reads into a complete genome. The mitochondrial genome is circular (in animals) and ~14000 base pairs in length. It encodes 2 ribosomal RNAs and 12 or 13 proteins (depending on which species is being examined). It also encodes 20 transfer RNAs.


The Brugia malayi mitochondrion

a circular molecule, with 12 protein coding genes and two rRNAs, just under 13,600 base pairs in length.

The sequencing reads were generated by Jennifer Daub, a research scientist in IEB, University of Edinburgh, as part of the Brugia malayi genome project.

The reads are each 300-600 bases long, and have been pre-trimmed of vector and "low quality" sequence.

You will use the program "Cap" (which stands for contig assembly program) which is available through the world wide web. CAP was developed by Xiaoqiu Huang; the programme website is http://deepc2.psi.iastate.edu/aat/cap/capdoc.html


1 IN A NEW BROWSER WINDOW Open the data file BM_mt.txt

In the file are 69 shotgun sequences, derived by performing sequencing reads on shotgun clones of the B. malayi mitochondrial genome.

Each clone is named "cloneX" where X is a number, and each sequence is labelled _f or _r depending on whether it was derived from using a primer directed to the left or f orward end or the right or reverse end of the insert.

Thus "clone123_f" and "clone123_r" form a "read pair" from each end of a single clone.

Each of the 40 sequences is in "fasta" format.

Fasta format is a simple format for DNA and protein sequences. The first line starts with a ">" symbol, followed by a description of the sequence (a definition line) and then a carriage return/new line. The definition in this case includes the sequence name, eg "clone123_r.fa : TYPE = DNA". The next lines are the sequence, and the end of each sequence record is simply the next ">" symbol.


2 IN A NEW BROWSER WINDOW Open a link to the CAP3 analysis server at Iowa State

Iowa State Banner

The Iowa State server offers bioinformatic tools to all to use. The Center is heavily involved in sequencing, annotating and understanding the human and other genomes.

PBIL Logo

Depending on the level of other users on these computers, the assembly will take from 2-10 minutes. If nothing happens, or you get some 'time out' or other errors, try the alternate CAP3 server at PBIL (http://pbil.univ-lyon1.fr/cap3.php)


3 Send your sequences to Iowa (or Lyon)

Select all of the sequence data in FASTA format from BM_mt.txt (or download to your local user space, and upload it using the "read sequences from local file " interactive BROWSE button; or select the fasta sequence data from sequence.html - but ONLY the fasta format sequence data)

and paste it into the window "Please enter your sequences in FASTA format".

cap3 entry box

There are lots of additional entry options below the sequence entry part... You dont need to enter data in these. The software at Iowa allows the user to screen for known repeats, vector sequence contamination and other things. As the sequences you are assembling is (a) relatively simple and (b) does not come from one of the well studies model organisms, these options are not necessry.

You can always come back to this point, turn the different alaysis options ON again, and see what changes...


4 Select "COMPUTE ASSEMBLY WITH CAP3 " by clicking on the button .

Your data will be sent to the remote computing facility which will carry out the alignment. ...

cap3 is running

Please be patient while this happens: DON'T keep clicking on the page while it is loading...


5 Examine the results of the assembly.
5.1: The file "contigs" contains the assembled sequences in fasta format
a "contig" is a set of contiguous sequences

How many contigs did you get?

Each line is 60 bases long: how much sequence in total has been assembled? (hint - just count lines and multiply by 60, or copy and paste into Word or NotePad and use the Tools->Word count options)

To assemble the sequences, the program first compares each individual sequence to all the others, remembering which had significant matches. It then attempts to align all the ones with matches together.

We have used the default parameters for the assembly, but it is possible to change these: for example, what length and %identity is considered a "match".

contigs


5.2 The file "Singlets" ("single sequences" at PBIL)is a list of the single sequences (called "singletons") that didn't assemble into contigs.

How many sequences did not assemble?

singletons


5.3 The file "capout" ("assembly details" at PBIL) gives an overview of the way cap3 put the sequence files together.

In the table describing the contigs, the read name is given, followed by a "+" or a "-": these symbols indicate the orientation of the read in the contig.

assembly

The reads are given in the order in which they appear in the assembled contig, reading from left to right. (So, in the above table, the contig starts with 20_for (which is on the + strand). 20_for overlaps with 3_rev, then 26_for joins the contig, etcetera. The text "clone4_for- is in clone 3_for-" means that the sequence of the f read from clone 4 is completely contained within the forward read from clone 3.

This is where folk often get confused! DNA is double stranded...

The strands in the clone are forward and reverse, and each clone has a forward (F) and reverse (R) read.

CAP3 calls the strands in the contig the "+" and a "-" strand.

Let's call the strands in the original mitochondrial genome Watson and Crick, or W and C.

Each pair of clone reads therefore SHOULD appear in the final genome assembly with the F and R reads on opposite strands.

In a contig, it may be that the F read is assembled on the + strand, in which case the R read should be on the - strand (and vice versa).

Because the clones are random insertions of the DNA into the vector, and thus can be in either orientation, and the contigs are not oriented with respect to each other or the genome, it is not the case that all F reads should be on the + strand, etc.

The + strand of the contig may correspond to either strand (W or C) of the genome... We can work this out by looking at read pairs (the F and R from one insert) that are mapped to different contigs.

Below the table are the alignments for each contig. Look at these.

 


6 Questions for you to answer....

6a What did the program have to do?

How many sequence comparisons did it carry out?
The text "Number of segment pairs" means the number of matches of any quality between sequences that were found

"number of pairwise comparisons" indicates the number of final comparisons that were used in assembly

Why so many?

Why is it very difficult to assemble thousands or millions of reads into a genome?

6b Draw maps of the sequence reads and their place in the contigs.

For example, here is a map of a short contig showing left and right reads with different coloured arrows. I have aligned at the same horizontal level reads from the same clone.

 

6c Is the assembly a good one? Why did you get more than one contig?

What does the program do when there are "N" residues (undetermined bases) in the sequences? (there are N's in clone48_r and clone35_f)

What does the program do when there are discrepancies between reads? (see, for example, 3' end of the first read in contig1, clone 17, or the polyA stretch that occurs ~ 100 bases after this in contig1)

Are there any problem areas?

Why do you think these problems might have arisen?

What could be done to fix this sort of problem?

What additional problems might these "fixes" create?

6d Completing this small genome...

Can you see any way to link the contigs, using either the sequence data you have, or the clone information?
Look at the clone names: each should have both left and right reads, but for some clones only one is available. The left and right reads from one clone should assemble with the correct relative orientation in the final genome.

The map above ("contig2") suggests that by finding read 4_F I might be able to link this contig with another to the right, as 4_R and 4_F come from the same clone. Similarly, finding 9_R and / or 23_F would link the left end of this contig.

 


Mark Blaxter original 14/01/2009 / edited 02/01/2011

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 


here is a set of precomputed output files and answers to the questions



Website Highlight

Blaxter Lab Publications 2006

evolution of operons

The tree of Nematoda

We have used molecular phylogenetics techniques to update our tree of the nematodes, revealing new patterns in the evolution of parasitism and of life on the land.

Meldal BH, Debenham NJ, De Ley P, De Ley IT, Vanfleteren JR, Vierstraete AR, Bert W, Borgonie G, Moens T, Tyler PA, Austen MC, Blaxter ML, Rogers AD, Lambshead PJ. An improved molecular phylogeny of the Nematoda with special emphasis on marine taxa. Mol Phylogenet Evol. 2006 Sep 23 doi:10.1016/j.ympev.2006.08.025.

[nematodes.org v4.0] the content of these pages is copyright Mark Blaxter and colleagues. Contact the webmaster if there are problems.