Clustering Analysis of the EST Dataset

NEMBASE

John Parkinson, David Guiliano and Mark Blaxter

with help from Dr. Callaghan, Steve Jones and Martin Aslett

The success of the EST sequencing effort means that in order to find a gene of interest it is now necessary to search through >23,000 individual sequences. In order to make this task easier we have instituted a process of "clustering" the EST sequences by identity. These clusters define putative "genes" and allow us to start defining their function and examining the roles they play in the nematode's biology. The Process is called CLOBB, and is descibed more fully on our informatics pages.

Each cluster contains the constituent ESTs and a consensus sequence derived from an alignment of these sequences. This simple format allows searching by sequence similarity (the consensus sequences are in general of much higher quality than any of the constituent ESTs) and by stage specificity and abundance of expression of each cluster. We have performed BLAST searches of public databases and included the results of in our local relational database, NEMBASE.

We feel it is important that these clusters are "permanent". That is, if a researcher discovers a cluster one day, that cluster will continue to exist under the same identifier in the future. The CLOBB process therefore

• allocates a unique identifier to each cluster

• has a mechanism for incremental updating of the database

• has a method for retiring clusters found to be in error, and merging clusters later found to derive from the same gene.

See here for an example of the sorts of data generated.