Acrobeles complexus

Mark Blaxter's teaching pages

@ The Blaxter Lab, Institute of Evolutionary Biology, School of Biological Sciences, The University of Edinburgh

University of Edinburgh crest   
Genomics Practical 2011

velvet worm

EST Sequencing: Analysis

In today's session you will take your Expressed Sequence Tag (EST) sequencing chromatograms,

and predict the sequence of the inserts of your selected clones.

READ THROUGH THIS DOCUMENT BEFORE STARTING THE DAY'S WORK...

The Lecture overheads are available as a PDF here.


1

Log in, launch a web browser, and direct it to
http://www.nematodes.org/teaching/genomics/Practical_2100/
(this page)

You should also launch MS Word or Notepad and open a file to act as your notebook for the practical.

Next week you will use bioinformatic resources on the world wide web to attempt to identify the encoded peptides, and investigate what their function might be in the nematode-bumblebee interaction.


2

About the library

The cDNA library from which you sequenced ESTs was constructed by Mark Blaxter in Edinburgh from double-stranded cDNA generated by Nicola Wrobel of the GenePool Genomics Facility. The original mRNA preparations were from Joseph Colgan, a PhD student of Trinity College Dublin (Joe is interested in the bee side of the bee-nematode interaction...).

The library was NOT directionally cloned, so the 5' end of each mRNA could be ligated to either side of the vector. The library was constructed in the plasmid vector pCR TopoII. The library was plated out, individual colonies picked to 200 µl of growth medium in microtitre plates (8 x 12 well plates) and grown overnight. A portion of the culture was transferred to new tubes- that is what you were given. The remainder has been frozen at -80degC as an archive.


3

Viewing your results

Your results are available as hyperlinks from here: GGIII2011 Results

*** Its probably best if you open a NEW BROWSER WINDOW (or tab) to view this table
so it is easier to switch back and forward between this document and the table ***

The results are organised by the clone names as written on the sequencing submission sheets (sorry if I have mis-copied some of your handwriting). Somehow we had doubles of one clone set: the students who did these should be able to work out which is which by the pattern of R and F sequencing. Several individuals gave us their names but not clone names, and so these are listed under what i could decipher from the labels.

Have a look at all of your sequencing chromatograms. You are going to build up a "QC report " for each of your clones...

When you first select a sequencing chromatogram or trace file from the table, the browser will launch a Java applet called "TraceViewer" which will display the chromatogram, and the sequence.

(A java applet is a small program that runs alongside (or inside) a browser, allowing the web browser to have additional features not usually present)

You should see something like this:

  good sequence

Sometimes the trace viewing application fails to reset itself between one sequence and the next. If you get a viewer without any content in the window, but can see the sliders, etc, 'reload' the web page.

Play with the three (left, right, bottom) sliders and the check boxes to see what they do. The blue-to-green histogram boxes above each base indicate the "quality" of the base call by the software. Higher and bluer is better.

If nothing appears in the trace view window, but lots of sequence appears below, it means your sequencing reaction failed and the base calling software was not able to find any bases of high enough quality to display a chromatogram for. It still predicted bases (up to a thousand of them!) but they are not "real".

The demonstrators should be able to comment on what they think went wrong that resulted in poor sequence quality.

For each of your sequences, view the sequencing chromatogram (or trace) file in TraceViewer (using the links in the table) and record where high quality sequence starts and ends.

In the text box, the sequence in COLOUR is the part of the sequence that is of high quality. Do you think that there are additional reasonable-quality base calls outside this high quality region (the software we use to do the base calling is very strict!)?

The sequence text may be longer than the display in the window, as the last 2-300 bases may have very low quality scores.

For many sequences we note that there are ~160 bases of high-signal but low-quality sequence. This is due to the presence in the PCR of a small amount of a contaminant band, which sequences very well. Your GOOD sequence is likely to start from after 160 bases.

Note that NOT every sequencing reaction will have worked! If you havent got any good ones, try the spare sequences labeled EK## .


4

The structure of your sequence reads

(note that each of you should try to look at 4 good sequences: if you do not have 4, "borrow" good ones from a class mate...)

PCR to sequence

The sequence was derived from a PCR product, itself derived from a plasmid (see diagram above).

The plasmid we used is called pBluescriptII

pBluescript

 

 

You generated a PCR product using the M13 -20 Forward and M13 Reverse (F and R) primers.

Work out how big a PCR product you would get if there was NO insert in the clone.


5

Identifying the insert sequence

You should now attempt to find the cloning site and identify where each of your insert sequences start. We used either the M13 F or the M13 R sequencing primer.

PCR Primer Forward ------i--n--s--e--r--t----------- PCR Primer Reverse

The sequence around the cloning site of the vector you need to identify is as follows (this is the reverse complement of the sequence after the primer promoter in the figure above)

M13 F or R : poly(T) first AACGGCCGCCAGTGTGCTGGAATTCGCCCTTAAGCAGTGGTATCAACGCAGAGTACTGGAG ------i--n--s--e--r--t-----------

M13 F or R : 5' end first AACGGCCGCCAGTGTGCTGGAATTCGCCCTTAAGCAGTGGTATCAACGCAGAGTACGC[GGGG] ------i--n--s--e--r--t-----------

The sequence at the very 5' of the sequencing reaction sometimes difficult to read, and thus the sequence may not be exactly AACGGCCGCCAGTG. Some of you will have sequenced all the way through your insert into the vector fragment on the other end. You should look for the characteristic GAATTC segment of the EcoR I site found in the vector 3' to the insert site. This is sometimes problematic, as low sequence quality after a poly(A) tail may mean that the exact sequence of the vector is not present. You can often see a CCCGGGG pattern, though, which is vector.

If the enzyme tries to read through a poly-nucleotide stretch, it tends to stutter, making sequence afterwards unreadable: this is common when you get to a poly(A) tail in cDNAs but its OK as it tells you you have got to the end of your insert (see the picture below... where the sequence after the poly A is unreadable and of very low quality). A trace file with this phenomenon is shown below:

The TraceViewer centred on a polyT stretch. Note how after the polyT there are "runs" of A's C's and G's... these are not likely to be real.

 


6

Quality checking of your sequences

Make a table in Word or Notepad and record the following information for each of your clones (a mockup of a table is available as a Word document here)

presence of 5' poly(T) or poly(G)

the base number where the insert sequence starts

the length of high-quality sequence for each of your sequencing reads

whether the sequence includes a 3' polyA tail and 3' vector

any comments about the sequence (such as "starts with poly(T)" or "It did not work - probably no DNA ")

Copy the putative insert sequence from the TraceViewer page to your Word notefile by dragging across the relevant segment of the sequence and copying, remembering to give each its unique clone name. By the end of today you should have collected four good INSERT sequences, and collected quality control information for them.


HelpHelpHelp!!! what do I do if I dont have 4 good sequences?

Do NOT panic.

There are 'spare' sequences available here... If you dont have 4 good ones, go to the set I did and choose a few to screen. Once you have 4 good ones, you can rest.


Next week

We will supply additional instructions for

performing similarity searches

identifying function of the encoded proteins

...but if you have got to here very quickly and want to zoom on, one thing to do is to try to identify if your sequence has an open reading frame in it.

There are several ways to find out:

(1) by eye using a table of codons (painful)

(2) using ARTEMIS as you did in TechSession4.

Save your sequence as a FASTA file (where the name of the sequence is on the first line, preceeded by a '>', and the sequence on second and subsequenct lines. This file MUST be in raw or plain text format, not in Word or other program specific format)

e.g

>SB00000959 my cool sequence
AGAGACAGTATCCGCGATAGAGCAGATCGGACAGCTAGGACA
GCTAGGACCGCTCGGACCGGTGTCGACAAGCTGACACAGCTA
GCACAGCT

Open the fasta file in ARTEMIS (see HERE if you have forgotten how!) and look for ORFs.

(3) Using online translation programmes.

For these you just paste your sequence in a window, choose the genetic code, and press go. The DNA is translated into protein and you can then choose which you believe might be the correct open reading frame.

Try: EMBOSS Transeq at http://www.ebi.ac.uk/emboss/transeq/

or the EXPASY Translate Tool at http://expasy.org/tools/dna.htm

Back to the Top

 

 


Website Highlight
Amblyomma americanum

The tick Amblyomma americanum.
See the ARTHROPODA database for analyses of ESTs from this chelicerate and many other arthropods.

[nematodes.org v4.0] the content of these pages is copyright Mark Blaxter and colleagues. Contact the webmaster if there are problems.