| Building TardiBASE |
TardiBASE is constructed using publically available Hypsibius dujardini DNA sequence data
deposited in Genbank. This currently includes only EST generated by our in-house
Expressed Sequence Tag (EST) sequencing initiative.
|
Processing sequences in-house
Sequences generated by our in-house research initiative will be processed in the following
manner :
- Raw sequences reads (trace files) are initialy processed using the software program 'PHRED'.
- Quality information provided by the '-trim' option of 'PHRED' is then used
to trim the sequences to regions of good quality.
- These sequences are then trimmed for vector/leader sequences and
polyA tails.
- Sequences greater than 150bp are then submitted to Genbank and
fed into the TardiBASE processing stream.
|
Clustering sequences
Sequence information is collated for each species, in many cases
the sequences will derive from the same gene. The next step is therefore
to cluster the sequences on the basis of homology to create groups
of sequences which potentially encode for the same gene. This cluster
process is undertaken using a program developed in-house based on
the 'BLAST' algorithm. In brief :
- Each sequence is 'BLAST'd against a growing database (db) of sequences.
- If the sequence shows a significant match with another sequence
in the db, it is assigned the same cluster number as the matching
sequence.
- If the sequence does not show a significant match with any sequence
in the db, it is assigned a new cluster number and added to the db.
- If the sequence matches two or more sequences in the db which do
not share a common clusternumber, the program will attempt to merge
the two clusters together.
- If merging is not possible, due to sequence conflicts,
the sequence is assigned the cluster number of the sequence for which it
had the highest scoring match and the clusters are noted for possible
human annotation at a later date.
Our program has the advantage over other clustering algorthims in
that cluster numbers are preserved over subsequent builds (this allows
for the generation and incorporation of any novel sequence data which
may arise).
|
Cluster Assembly
The next step in the build process is to assemble the clusters into
consensus sequences. This is acheived using the program 'CAP3', or
when this fails by the program 'PHRAP'.
|
Functional Analyses
Functional analyses of the clusters is done by...
Blasting each consensus sequence against 3 databases:
- BlastN vrs non-redundant DNA db
- BlastX vrs non-redundant protein db
- BlastN vrs DbEST
Protien translation using DECODER
Domain anlaysis using InterProScan |
Importing the data into TardiBASE
TardiBASE resides in a postgreSQL database on an Intel box running
Linux. It exists in the form of a number of relational tables which
may be queried using the SQL language. The data is imported using a
number of simple perl scripts which obtain sequence, library,
expression, blast information, protein sequence and protein domain infromation
from the previously generated data.
PHP (a server-side, cross-platform, HTML embedded scripting language)
is then used to create web-based forms which enable comprehensive
searching of this data.
|