BaNG - Blaxter Nematode and Neglected Genomics
  The C. elegans genome
     Introduction to the genome of a model nematode
       Mark Blaxter at the Institute of Evolutionary Biology, University of Edinburgh
How was the genome sequenced?
Annotating the genome
 


Overall patterns in the C. elegans genome

For patterns of sequence organisation by chromosome, see this table derived from Table 2 of the 1998 Science paper


1 Distribution of DNA in intergenic DNA, exons and introns

intergenic DNA

47%

exonic DNA

27%

intronic DNA

26%

2 Distribution of repetitive DNAs

2.7% of the genome is simple, short tandem repeat

3.6% is simple, short inverted repeat

repeat type

tandem

inverted

intergenic DNA

49%

55%

exonic DNA

<1%

<1%

intronic DNA

51%

45%

There are 38 defined dispersed repeat families: many of these correspond to transposon-like elements. many transposons (Tc elements) had already been defined in C. elegans as mutagenic elements. Many of the dispersed repeat families appear to be relics of transposon families no longer active, including four novel families in the Tc1/mariner group.

Some individual repeats have strikingly partitioned locations in the genome, the functional significance of which is unclear.

repeat family

features

CeRep26

telomeric repeat. Not found in introns

CeRep27

Not found in introns

CeRep11

712 copies, only one of which is on the X

There are many instances of dupication of segments of genomic DNA, some including several expressed genes. Some of these duplications have diverged in sequence enough to confirm that both copies are expressed.


3 Distribution of genes across chromosomes

The autosomes had been divided genetically into central clusters (where recombination appeared to be suppressed) and arms (where recombination rates were significantly greater). The X chromosome has a uniform recmbination rate.

Analysis of gene density over these genetically defined intervals reveals that the autosome arms

have more repeats (particularly some families)

have fewer corresponding ESTs

have fewer genes that have a significant match to non-nematode proteins

have more clusters of closely related genes

This suggests that these regions may be rapidly evolving, and may be the birthplace of new genes and gene families.


4 Structure of the genes

The current protein-coding gene dataset is 19,099 genes. There are over 900 RNA coding genes.

The average C. elegans gene is ~3 kb long and has 5 introns. Most C. elegans genes have introns. Many introns are very small (37-70 bases). This class of small introns is distinct in sequence content from the remaining larger introns (100 bases to 10 kb).

Most C. elegans genes are trans-spliced at their 5' end to a small leader exon, called SL-1. Estimates of the actual proportion ranges from 80% to 60%.

About 25% of C. elegans genes are organised as operons, groups of cotranscribed genes (up to five genes in one putative operon have been observed). Polycistronic premRNAs are resolved into individual mRNAs by a trans-splicing process that includes the addition of a trans-spliced leader exon at the 5' end of the downstream gene. The SL used for resolution of this downstream transcript can be SL-1 or one of a family of SL-2-like SLs. About 10% of C. elegans genes are expected to recieve SL2-like SLs.

The C. elegans EST dataset (at the time of publication 68,000 ESTs from 40,000 clones corresponding to 7432 different genes, now over 100,000 ESTs and 8,500 genes) covers 15% of the predicted coding DNA. 40% of predicted genes have EST matches. 92% of splice predictions were exact, and 97% of introns predicted by GENEFINDER overlapped with those defined by the EST cDNA sequences. Many instances of alternative splicing have been defined. There are very few groups of unmapped cDNAs (<10) and thus the sequenced regions are unlikely to have many coding genes.

 

These pages were written by Mark Blaxter and last updated in early 2007.
Contact the www.nematodes.org webmaster if there are problems.