BaNG - Blaxter Nematode and Neglected Genomics
  The C. elegans genome
     Introduction to the genome of a model nematode
       Mark Blaxter at the Institute of Evolutionary Biology, University of Edinburgh
How was the genome sequenced?
Annotating the genome
 

Expressed sequence tag (EST) analysis

C. elegans ESTs | Other Nematode ESTs

An expressed sequence tag is a single pass sequence taken from a randomly selected cDNA clone. ESTs are used to investigate the diversity of genes expressed by an organism, tissue or cell. By looking at only expressed sequences we can

  • avoid the expense of complete genome sequencing (no introns or intergenic DNA are sequenced)
  • allow the organism/tissue/cell to instruct us as to what is "important" in terms of expression levels of genes
  • permit assessment of differential gene expression by comparing stage or tissue specific datasets
  • confirm splicing and coding predictions from genomic DNA sequences

The genes expressed in a tissue or stage contribute to a pool of mRNAs. The relative levels of expression of each mRNA in this steady state pool reflects (but may not be equivalent to) the level of expression of the encoded proteins. In EST analysis, a cDNA library is first constructed from mRNA.

mRNA steady state pool

converted to double stranded cDNA

and cloned into a bacterial vector (either plasmid or bacteriophage)

The distribution of abundances of cDNAs in the cDNA library if carefully made will reflect the distribution of mRNAs present in the original tissue.

The cDNA library is then sampled at random. Each clone is sequenced in one direction from either the 3' or 5' end (libraries are usually made such that the cDNA is cloned directionally; some EST projects perform 5' sequencing, others 3', others both).

These randomly selected sequences are then grouped by identity to classify them into groups or clusters that derive from a single gene.

The clusters can be used to

  • derive a consensus sequence that may be better (more reliable, due to the overlaps, and longer) than each individual EST
  • perform similarity and other database analyses
  • examine expression patterns in the dataset.

It is thus possible to identify abundantly expressed genes by their relative abundance in the EST dataset. Genes expressed at low levels may not be represented in the ESTs despite their expression being positive in the sample, due to the stochastic nature of the selection process.

Thus the relative abundance of a gene transcript in the EST dataset can be taken as a general but not as a definitive measure of its abundance in the initial mRNA dataset.

Other cautions in interpreting EST data are

  • EST clusters deriving from nonoverlapping portions of the same long transcript will not be identified as coming from the same gene, simultaneously inflating the number of genes apparently expressed, and decreasing the apparent abundance of expression of the gene
  • the mechanics of constructing cDNA libraries can bias against very short cDNAs (as a size selection process is usaully used), and very long transcripts (where the mRNA is less stable)
  • libraries that have been amplified before sequencing may give biased estimates of the abundance of expression of certain genes as their growth in E. coli is not always equivalent
  • for extensive EST analysis, various processes of normalisation are often used. These attempt to even out the relative levels of sequences in the library (either before cloning, or after it has been constructed). This obviously affects interpretation of the relative EST abundances.

     

These pages were written by Mark Blaxter and last updated in early 2007.
Contact the www.nematodes.org webmaster if there are problems.