|
|
Clustering Analysis of the EST Dataset
John Parkinson, David Guiliano and Mark Blaxter
with help from Dr. Callaghan, Steve Jones and Martin Aslett
The success of the EST sequencing effort means that in order to find a gene of interest it is now necessary to search through >23,000 individual sequences. In order to make this task easier we have instituted a process of "clustering" the EST sequences by identity. These clusters define putative "genes" and allow us to start defining their function and examining the roles they play in the nematode's biology. The Process is called CLOBB, and is descibed more fully on our informatics pages.
Each cluster contains the constituent ESTs and a consensus sequence derived from an alignment of these sequences. This simple format allows searching by sequence similarity (the consensus sequences are in general of much higher quality than any of the constituent ESTs) and by stage specificity and abundance of expression of each cluster. We have performed BLAST searches of public databases and included the results of in our local relational database, NEMBASE.
We feel it is important that these clusters are "permanent". That is, if a researcher discovers a cluster one day, that cluster will continue to exist under the same identifier in the future. The CLOBB process therefore
allocates a unique identifier to each cluster
has a mechanism for incremental updating of the database
has a method for retiring clusters found to be in error, and merging clusters later found to derive from the same gene.