BaNG - Blaxter Nematode and Neglected Genomics

Genomes & Genomics
Mark Blaxter's Teaching WebSite

  at the Institute of Evolutionary Biology, University of Edinburgh
Courses:
Honours:
 

Introduction to Caenorhabditis elegans

Introduction to Phylogenetics

Genome Sequencing and Annotation for Informatics MSc

MSc in Bioinformatics

The BTO

 
Genomes and Genomics 3
Tech session 1
Assembling sequences

Mark Blaxter

this file is at http://www.nematodes.org/teaching/Tech1_Sequencing/index.shtml

The lecture presentation is available as a pdf here


General Notes for TechSessions

These Tech sessions are meant to give you experience of the tools used in Genome Biology, and to help you understand some of the issues and concepts used. They can be examined, and some (not this one) also have work to hand in. They are generally available on the web, and after the session, some answers to questions posed may be posted on the websites.

It is good practice for you to keep a "notebook" of what you do in the Tech sessions. We suggest you make notes of your workings in, for example, a Word document or Notepad document. You can use this document to record the settings of the programs you use, the results (by copying and pasting from other program windows), and your conclusions and questions. If you keep these notes carefully, they will stand you in good stead for later Tech sessions, the practical, and for revision.


Assembling sequences

In this practical session you will be taking a set of sequence reads derived from a mitochondrial genome, and using one of the programs used by the big genome sequencing centres to "assemble" these sequence reads into a complete genome. The mitochondrial genome is circular (in animals) and ~14000 base pairs in length. It encodes 2 ribosomal RNAs and 12 or 13 proteins (depending on which species is being examined). It also encodes 20 transfer RNAs.


The Brugia malayi mitochondrion

a circular molecule, with 12 protein coding genes and two rRNAs, just under 13,600 base pairs in length.

The sequencing reads were generated by Jennifer Daub, a research scientist in IEB, University of Edinburgh, as part of the Brugia malayi genome project. See the Brugia malayi genome project (http://www.nematodes.org/fgn/index.shtml) home page for more information on this parasitic nematode.

The reads are each 300-600 bases long, and have been pre-trimmed of vector and "low quality" sequence.

You will use the program "Cap" (which stands for contig assembly program) which is available through the world wide web. CAP was developed by Xiaoqiu Huang; the programme website is http://deepc2.psi.iastate.edu/aat/cap/capdoc.html


1 IN A NEW BROWSER WINDOW
Open the data file sequences.html

In the file are 69 shotgun sequences, derived by performing sequencing reads on shotgun clones of the B. malayi mitochondrial genome.

Each clone is named "clX" where X is a number, and each sequence is labelled _forward or _reverse depending on whether it was derived from using a primer directed to the left or f orward end or the right or reverse end of the insert.

Thus "cl123_forward" and "cl123_reverse" form a "read pair" from each end of a single clone.

Each of the 40 sequences is in "fasta" format.

Fasta format is a simple format for DNA and protein sequences. The first line starts with a ">" symbol, followed by a description of the sequence (a definition line) and then a carriage return/new line. The definition in this case includes the sequence name, eg "clone123_r.fa : TYPE = DNA". The next lines are the sequence, and the end of each sequence record is simply the next ">" symbol.

The same data are in the file BM_mt.fsa, but without the html web markup and other information.


2 IN A NEW BROWSER WINDOW
Open a link to the CAP3 analysis server at the Japanese Human Genome Center

Tokyo Human Genome Project

The Japanese Human Genome Center server offers bioinformatic tools to all to use. The Center is heavily involved in sequencing, annotating and understanding the human and other genomes.

3 Send your sequences to Japan

Select all of the sequence data (but ONLY sequence data) in sequence.html from the first browser window (or download BM_mt.fsa to your local user space, and upload it using the "FILE UPLOAD(FASTA format)" interactive BROWSE button) and paste it into the window "TEXT DATA , Please paste your sequences (FASTA format)". [Don't put my name in the email address line... please!]

HGCimage

There are lots of additional options below the sequence entry part... Skim down the page, UNCHECKING all of these. The software at HGC allows the user to screen for known repeats, vector sequence contamination and other things. Leave the PERFORM SEQUENCE ASSEMBLY option ON. As the sequence you are assembling is (a) relatively simple and (b) does not come from one of the well studies model organisms, these options are not necessry.

You can always come back to this point, turn the different alaysis options ON again, and see what changes...


4 Select "SUBMIT" by clicking on the button .

Your data will be sent to the remote computing facility which will carry out the alignment.

Depending on the level of other users on these computers, the assembly will take from 2-10 minutes. If nothing happens, or you get some 'time out' or other errors, try the alternate CAP3 server at Iowa State University (http://deepc2.psi.iastate.edu/aat/cap/cap.html) If you use this server the output files are more numerous than at HGC, but there will be ones called ..singlets, ...contigs and ...capout, which correspond to the singlets, contigs and assembly details files mentioned below.

Please be patient while this happens: DON'T keep clicking on the page while it is loading...

HGC2


5 Examine the results of the assembly.

HGC3


Contigs is the assembled sequences in fasta format

a "contig" is a set of contiguous sequences

How many contigs did you get?

Each line is 60 bases long: how much sequence in total has been assembled? (hint - copy and pasete the sequences into Word and use Tools->Word count, or just count lines and multiply by 60)

To assemble the sequences, the program first compares each individual sequence to all the others, remembering which had significant matches. It then attempts to align all the ones with matches together.

We have used the default parameters for the assembly, but it is possible to change these: for example, what length and %identity is considered a "match".

contigs


Single sequences is a list of the single sequences (called "singletons") that didn't assemble into contigs.

How many sequences didn't assemble

?singletons


Assembly details gives an overview of the way cap3 put the sequence files together.

In the table describing the contigs, the read name is given, followed by a "+" or a "-": these symbols indicate the orientation of the read in the contig.

assembly

The reads are given in the order in which they appear in the assembled contig, reading from left to right. (So, in the above table, the contig starts with 20_f (which is on the + strand). 20_f overlaps with 19_r (on the - strand) , then 3_r (+), then 26_f (+) joins the contig, etcetera. The text "clone4_f- is in clone 3_f-" means that the sequence of the f read from clone 4 is completely contained within the f read from clone 3.

This is where folk often get confused! Each clone has a forward (f) and reverse (r) read.

The contig has a "+" and a "-" strand.

Each pair of clone reads therefore SHOULD appear in the assembly with the f and r reads on opposite strands. It may be that the F read is assembled on the + strand, in which case the R read should be on the - strand (and vice versa). Because the clones are random insertions of the DNA into the vector, and thus can be in either orientation, it is not the case that all F reads should be on the + strand, etc.

Sometimes the two strands of DNA are called Watson and Crick, or W/C, or top and bottom)

Below the table are the alignments for each contig. Look at these.


6 Questions for you to answer....

a What did the program have to do?

How many sequence comparisons did it carry out?
The text "Number of segment pairs" means the number of matches of any quality between sequences that were found

"number of pairwise comparisons" indicates the number of final comparisons that were used in assembly

Why so many? Why so few?

Why is it very difficult to assemble thousands or millions of reads into a genome?

b Draw maps of the sequence reads and their place in the contigs.

For example, here is a map of a short contig showing left and right reads with different coloured arrows. I have aligned at the same horizontal level reads from the same clone.

 

c Is the assembly a good one? Why did you get more than one contig?

What does the program do when there are "N" residues (undetermined bases) in the sequences? (there are N's in cl48_r and cl35_f)

What does the program do when there are discrepancies between reads? (see, for example, the 5' endf of cl24_f)

Are there any problem areas?

Why do you think these problems might have arisen?

What could be done to fix this sort of problem?

What additional problems might these "fixes" create?

d Completing this small genome...

Can you see any way to link the contigs, using either the sequence data you have, or the clone information?
Look at the clone names: each should have both left and right reads, but for some clones only one is available. The left and right reads from one clone should assemble with the correct relative orientation in the final genome.

The map above ("contig2") suggests that by finding read 4_f I might be able to link this contig with another to the right, as 4_r and 4_f come from the same clone. Similarly, finding 9_r and / or 23_f would link the left end of this contig.


>here is a precomputed set of cap results for these files....

"Answers" to the questions asked (pdf) - this is a very old set of specific answers, but the general answers are still valid ;-)

Mark Blaxter


the content of these pages is copyright Mark Blaxter and colleagues. Contact the webmaster if there are problems.