Genomics
Practical 2009

EST
Sequencing: Analysis
In today's
session you will take your Expressed Sequence Tag (EST) sequencing
chromatograms,
and predict
the sequence of the inserts of your selected clones.
READ THROUGH
THIS DOCUMENT BEFORE STARTING THE DAY'S WORK...
Next week you
will use bioinformatic resources on the world wide web to attempt to identify
the encoded peptides, and investigate whether any of the genes you have identified could play a role in host-parasite interactions.
The Lecture overheads are available as a PDF here.
1 
Log in, launch
a web browser, and direct it to
http://www.nematodes.org/teaching/genomics/Practical_2010_Euperipatoides_kanagrensis/
(this
page)
You should
also launch MS Word or Notepad and open a file to act as your
notebook for the practical.
Next week you
will use bioinformatic resources on the world wide web to attempt to identify
the encoded peptides, and investigate whether Onychophora (and Arthropoda) are more closely related to Annelida (earthworms and allies) or Nematoda (roundworms).
2

About the library
The cDNA library from which
you sequenced ESTs was constructed by Joachim Eriksson in Cambridge from
double-stranded cDNA supplied from Australian velvet worms.
The library was NOT directionally
cloned, so the 5' end of each mRNA could be ligated to either side of
the vector. The library was constructed in the plasmid vector pBluescriptII. The library was plated out in Edinburgh, individual colonies picked to 200 µl of growth medium in microtitre plates (8 x 12 well plates) and grown overnight. A portion of the culture was transferred to a new plate - that is what you were given. The remainder was frozen at -80degC as an archive.
3 
Viewing your results
Your results are available as
hyperlinks from here: GGIII2010 Results
*** Its probably best if
you open a NEW BROWSER WINDOW (or tab) to view this table
so it is easier to switch back and forward between this document and
the table ***
The results are organised by
your surname as written
on the sequencing submission sheets (sorry if I have mis-copied some
of your handwriting). Several individuals gave us clone names but not their names, and so these are listed as Mystery.
Have a look at all of
your sequencing chromatograms. You are going to build up a "CV" for
each of your clones...
When you first select a
sequencing chromatogram or trace file from the table, the browser
will launch a Java applet called "TraceViewer" which will display the
chromatogram, and the sequence.
(A java applet is a small
program that runs alongside (or inside) a browser, allowing the web
browser to have additional features not usually
present)
You should see something like
this:
Sometimes the trace viewing application fails to reset itself between one sequence and the next. If you get a viewer without any content in the window, but can see the sliders, etc, 'reload' the web page.
Play with the three
(left, right, bottom) sliders and the check boxes to see what
they do. The blue-to-green histogram boxes above each base indicate
the "quality" of the base call by the software. Higher and bluer is
better.
If nothing appears in the
trace view window, but lots of sequence appears below, it means your
sequencing reaction failed and the base calling software was not able to find any bases of high
enough quality to display a chromatogram for. It still predicted
bases (up to a thousand of them!) but they are not "real".
The
demonstrators should be able to comment on what they think went wrong
that resulted in poor sequence quality.
For each of your sequences,
view the sequencing chromatogram (or trace) file in TraceViewer
(using the links in the table)
and record where high quality sequence starts and ends.
In the text box, the sequence
in COLOUR is the part of the sequence that is of high quality. Do you think that there are additional reasonable-quality base calls outside this high quality region
(the software we use to do the base calling is very
strict!)?
The sequence text may be
longer than the display in the window, as the last 2-300 bases may
have very low quality scores.
For many sequences we note that there are ~160 bases of high-signal but low-quality sequence. This is due to the presence in the PCR of a small amount of a contaminant band, which sequences very well. Your GOOD sequence is likely to start from after 160 bases.
Note that NOT every
sequencing reaction will have worked! If you havent got any good ones, try the spare sequences labeled EK## .
4 
Identifying
the insert sequence
(note that
each of you should try to look at 4 good sequences: if you do not have 4,
"borrow" good ones from a class mate...)

The sequence was derived from
a PCR product, itself derived from a plasmid (see diagram
above).
The plasmid we used is called pBluescriptII


You generated a PCR product using the M13 -20 Forward and M13 Reverse (F and R) primers.
Work out how big a PCR product you would get if there was NO insert in the clone.
5 
You should now attempt to find the
cloning site and identify where each of your insert sequences start. We used a T3 sequencing primer.
PCR Primer Forward ------i--n--s--e--r--t-----------
PCR Primer Reverse
The sequence around the
cloning site of the vector you need to identify is as
follows (this is the reverse complement of the sequence after the T3 promoter in the figure above)
CGGCCGCTCTGAACTAGTGGATCCCCCGGGCTGCAGGAATTCGGCACGAGG ------i--n--s--e--r--t-----------
The sequence
5' of the insert is sometimes difficult to read, and thus the
sequence may not be exactly GAATTCGGCACGAGG.
Some of you will have sequenced all the way through your insert into
the vector fragment on the other end. You should look for the
characteristic GAATTC segment of
the EcoR I site found in the vector 3' to the insert site. This is
sometimes problematic, as low sequence quality after a poly(A) tail may mean that the
exact sequence of the vector is not present. You can often see a CCCGGGG pattern, though, which is vector.
If the enzyme tries to read through a poly-nucleotide stretch, it
tends to stutter, making sequence afterwards unreadable: this is
common when you get to a poly(A) tail in cDNAs but its OK as it tells
you you have got to the end of your insert (see the picture below...
where the sequence after the poly A is unreadable and of very low
quality). A trace file with this phenomenon is shown
below:

|
The TraceViewer centred on a polyT stretch. Note
how after the polyT there are "runs" of A's C's and G's...
these are not likely to be real. |
Some of you will have sequenced all the way through your insert into
the vector fragment on the other end. You should look for the
characteristic GAATTC segment of
the EcoR I site found in the vector 3' to the insert site. This is
sometimes problematic, as low sequence quality after a poly(A) tail may mean that the
exact sequence of the vector is not present. You can often see a GGGGGCCC pattern, though, which is very likely to be vector.
6 
Make a table
in Word or Notepad and record the following information for each of your clones
the
length of high-quality sequence for each of your sequencing reads
the base
number where the insert sequence starts
whether the
sequence includes a 3' polyA tail and 3' vector
any
comments about the sequence (such as "starts with poly(T)" or "It did not
work - probably no DNA ")
Copy the
putative insert sequence from the TraceViewer page to your Word
notefile by dragging across the relevant segment of the sequence and
copying, remembering to give each its unique clone name. By the end
of today you should have collected four good INSERT sequences, and collected
information for their "cirriculum vitae".
Next week 
We will supply additional
instructions for
performing similarity
searches
identifying function of the
encoded proteins
...but if you have got to here very quickly and want to zoom on, one thing to do is to try to identify if your sequence has an open reading frame in it.
There are several ways to find out:
(1) by eye using a table of codons (painful)
(2) using ARTEMIS as you did in TechSession4.
Save your sequence as a FASTA file (where the name of the sequence is on the first line, preceeded by a '>', and the sequence on second and subsequenct lines. This file MUST be in raw or plain text format, not in Word or other program specific format)
e.g
>HP_ADY_000A00 my cool sequence
AGAGACAGTATCCGCGATAGAGCAGATCGGACAGCTAGGACA
GCTAGGACCGCTCGGACCGGTGTCGACAAGCTGACACAGCTA
GCACAGCT
Open the fasta file in ARTEMIS (see HERE if you have forgotten how!) and look for ORFs.
(3) Using online translation programmes.
For these you just paste your sequence in a window, choose the genetic code, and press go. The DNA is translated into protein and you can then choose which you believe might be the correct open reading frame.
Try: EMBOSS Transeq at http://www.ebi.ac.uk/emboss/transeq/ or the EXPASY Translate Tool at http://expasy.org/tools/dna.htm
Back to the
Top