Building NEMBASE
NEMBASE is constructed using publically available DNA sequence data deposited in Genbank. These include those generated by our in-house Expressed Sequence Tag (EST) sequencing initiative which aims to produce 100,000 sequences from 5 different parasitic nematodes.
Sequences generated by our in-house research initiative are processed using our software tool 'trace2dbest'
This involves:
  1. Raw sequences reads (trace files) are initialy processed using the software program 'PHRED'.
  2. Quality information provided by the '-trim' option of 'PHRED' is then used to trim the sequences to regions of good quality.
  3. These sequences are then trimmed for vector/leader sequences and polyA tails.
  4. Sequences greater than 150bp are then submitted to Genbank.
PartiGene is used for the analysis of the sequence data
This involves:
    Clustering using CLOBB.pl. Sequence information is collated for each species, in many cases the sequences will derive from the same gene. Therefore sequences are clustered on the basis of homology to create groups of sequences which potentially encode for the same gene. This cluster process is undertaken using CLOBB.pl, a program developed in-house based on the 'BLAST' algorithm. In brief :
  1. Each sequence is 'BLAST'd against a growing database (db) of sequences.
  2. If the sequence shows a significant match with another sequence in the db, it is assigned the same cluster number as the matching sequence.
  3. If the sequence does not show a significant match with any sequence in the db, it is assigned a new cluster number and added to the db.
  4. If the sequence matches two or more sequences in the db which do not share a common clusternumber, the program will attempt to merge the two clusters together.
  5. If merging is not possible, due to sequence conflicts, the sequence is assigned the cluster number of the sequence for which it had the highest scoring match and the clusters are noted for possible human annotation at a later date.
Our program has the advantage over other clustering algorthims such that cluster numbers are preserved over subsequent builds (this allows for the generation and incorporation of any novel sequence data which may arise).
Cluster Assembly
The next step in the build process is to assemble the clusters into consensus sequences. This is acheived using the program ' 'PHRAP'.
Functional Analyses
At present functional analyses involves the simple blasting of each consensus sequence against the following 3 db's :
  1. BlastN vrs non-redundant DNA db
  2. BlastX vrs non-redundant protein db
  3. BlastN vrs dbEST
Importing the data into NEMBASE
NEMBASE resides in a postgreSQL database on an Intel box running Linux. It exists in the form of a number of relational tables which may be queried using the SQL language. The data is imported using a number of simple perl scripts which obtain sequence, library, expression and blast information from the previously generated data. PHP (a server-side, cross-platform, HTML embedded scripting language) is then used to create web-based forms which enable comprehensive searching of this data.