News Overview Construction Searching Cluster Info. Protein Info. SimiTri Conf

Building TardiBASE

TardiBASE is constructed using publically available Hypsibius dujardini DNA sequence data deposited in Genbank. This currently includes only EST generated by our in-house Expressed Sequence Tag (EST) sequencing initiative.

Processing sequences in-house
Sequences generated by our in-house research initiative will be processed in the following manner :
  1. Raw sequences reads (trace files) are initialy processed using the software program 'PHRED'.
  2. Quality information provided by the '-trim' option of 'PHRED' is then used to trim the sequences to regions of good quality.
  3. These sequences are then trimmed for vector/leader sequences and polyA tails.
  4. Sequences greater than 150bp are then submitted to Genbank and fed into the TardiBASE processing stream.

Clustering sequences
Sequence information is collated for each species, in many cases the sequences will derive from the same gene. The next step is therefore to cluster the sequences on the basis of homology to create groups of sequences which potentially encode for the same gene. This cluster process is undertaken using a program developed in-house based on the 'BLAST' algorithm. In brief :
  1. Each sequence is 'BLAST'd against a growing database (db) of sequences.
  2. If the sequence shows a significant match with another sequence in the db, it is assigned the same cluster number as the matching sequence.
  3. If the sequence does not show a significant match with any sequence in the db, it is assigned a new cluster number and added to the db.
  4. If the sequence matches two or more sequences in the db which do not share a common clusternumber, the program will attempt to merge the two clusters together.
  5. If merging is not possible, due to sequence conflicts, the sequence is assigned the cluster number of the sequence for which it had the highest scoring match and the clusters are noted for possible human annotation at a later date.
Our program has the advantage over other clustering algorthims in that cluster numbers are preserved over subsequent builds (this allows for the generation and incorporation of any novel sequence data which may arise).

Cluster Assembly
The next step in the build process is to assemble the clusters into consensus sequences. This is acheived using the program 'CAP3', or when this fails by the program 'PHRAP'.

Functional Analyses
Functional analyses of the clusters is done by...
  • Blasting each consensus sequence against 3 databases:
    1. BlastN vrs non-redundant DNA db
    2. BlastX vrs non-redundant protein db
    3. BlastN vrs DbEST
  • Protien translation using DECODER
  • Domain anlaysis using InterProScan

  • Importing the data into TardiBASE
    TardiBASE resides in a postgreSQL database on an Intel box running Linux. It exists in the form of a number of relational tables which may be queried using the SQL language. The data is imported using a number of simple perl scripts which obtain sequence, library, expression, blast information, protein sequence and protein domain infromation from the previously generated data. PHP (a server-side, cross-platform, HTML embedded scripting language) is then used to create web-based forms which enable comprehensive searching of this data.