Expressed sequence
tag (EST) analysis
C. elegans ESTs | Other Nematode
ESTs
An
expressed sequence tag is a single pass sequence taken from a
randomly selected cDNA clone. ESTs are used to investigate the
diversity of genes expressed by an organism, tissue or cell. By
looking at only expressed sequences we can
- avoid the expense of complete genome sequencing
(no introns or intergenic DNA are sequenced)
- allow the organism/tissue/cell to instruct us as
to what is "important" in terms of expression levels of
genes
- permit assessment of differential gene expression
by comparing stage or tissue specific datasets
- confirm splicing and coding predictions from
genomic DNA sequences
The genes expressed in a tissue or stage contribute
to a pool of mRNAs. The relative levels of expression of each mRNA in
this steady state pool reflects (but may not be equivalent to) the
level of expression of the encoded proteins. In EST analysis, a cDNA
library is first constructed from mRNA.
mRNA steady state pool

converted to double stranded cDNA

and cloned into a bacterial vector (either plasmid or
bacteriophage)

The distribution of abundances of cDNAs in the cDNA
library if carefully
made will reflect the distribution of mRNAs
present in the original tissue.
The cDNA library is then sampled at random. Each
clone is sequenced in one direction from either the 3' or 5' end
(libraries are usually made such that the cDNA is cloned
directionally; some EST projects perform 5' sequencing, others 3',
others both).

These randomly selected sequences are then grouped by
identity to classify them into groups or clusters that derive from a
single gene.

The clusters can be used to
- derive a consensus sequence that may be better
(more reliable, due to the overlaps, and longer) than each
individual EST
- perform similarity and other database
analyses
- examine expression patterns in the
dataset.
It is thus possible to identify abundantly expressed
genes by their relative abundance in the EST dataset. Genes expressed
at low levels may not be represented in the ESTs despite their
expression being positive in the sample, due to the stochastic nature
of the selection process.

Thus the relative abundance of a gene transcript in
the EST dataset can be taken as a general but not as a definitive
measure of its abundance in the initial mRNA dataset.
Other cautions in interpreting EST data are
- EST clusters deriving from nonoverlapping
portions of the same long transcript will not be identified as
coming from the same gene, simultaneously inflating the number of
genes apparently expressed, and decreasing the apparent abundance
of expression of the gene
the mechanics of constructing cDNA libraries can
bias against very short cDNAs (as a size selection process is
usaully used), and very long transcripts (where the mRNA is less
stable)
- libraries that have been amplified before
sequencing may give biased estimates of the abundance of
expression of certain genes as their growth in E. coli is not
always equivalent
- for extensive EST analysis, various processes of
normalisation are often used. These attempt to even out the
relative levels of sequences in the library (either before
cloning, or after it has been constructed). This obviously affects
interpretation of the relative EST abundances.