
Genome Mapping
"A map of the Brugia
genome by 1999"
discussion document prepared by Mark
Blaxter
for the 1997 Filarial Genome Project meeting
The proposed
strategy adopted is here.
[ what is a map | why a map? |
restriction maps | clone types |
fingerprinting | STS content |
sample-without-replacement | the
genome | references ]
1 What is a genome map?
an ordered set of (cloned) DNA fragments representing a genome or
genome segment (chromosome)
The ultimate genome map is the complete sequence. This is being done
for a number of organisms, including the nematode Caenorhabditis
elegans. In practice, a map of cloned DNA fragments is constructed
first before sequencing is started, though for smaller genomes (up to
3 Mb) it is possible to generate complete genome sequence from random
shotgun clones. This approach has not been tried for larger
genomes.
2 Why a genome map for
Brugia?
- to expedite the cloning of genomic copies of genes-of-interest
- to investigate gene structure
- to investigate gene order
- to allow direct comparison with C. elegans
- conservation of gene order at a local level?
- operons? Are genes of related function physically linked?
- long-range synteny?
- cloning of genes-of-potential-interest by synteny
3 Ways of constructing genome
maps I: Restriction mapping.
This is achievable for small genomes (viruses (kb) to bacteria (Mb)
and yeasts (15 Mb) and protozoan chromosomes (0.2 to 7 Mb, 35 Mb
total))
For the upper end of this size range, it is necessary to use
rare-cutting enzymes - either 8-base pair recognition site enzymes or
6-base pair recognition site enzymes where the base compositional
bias of the genome makes the site rare.
For the larger genomes, hybridisation (with either random (sequence
tagged site or STS) or selected gene probes are used to correlate
fragments produced by different enzymes.
Often a restriction map is constructed in conjunction with a map made
by other means, and the rare-cutting sites are then useful as genetic
markers. Linking clones (ones which contain a site for the enzyme
being used) can be isolated using a strong positive selection
procedure.
4 Ways of constructing genome maps
II: Cloned DNA
Genomic DNA can be cloned in a number of different vector systems. In
practice four different types of vectors are of utility.
- lambda phage
- these can carry up to 25 kb of foreign DNA
- cosmids
- these can carry up to 40 kb of foreign DNA
- bacterial artificial chromosomes or BACs, P1 phage
- these can carry from 80-150 kb of foreign DNA
- yeast artificial chromosomes or YACs
- these can carry from 100 to >2000 kb of foreign DNA
- (Burke et al., 1987; Coulson et al., 1988)
A genomic clone library is characterised by two parameters:
- the mean insert size
- the "depth" of the library.
The depth is a measure of the number of "genome equivalents"
contained within the library. Thus one genome equivalent of the B.
malayi genome (100 Mb) in a 100 kb mean insert size BAC library is
1000 clones (this is referred to as a 1x library). Usually libraries
of a depth greater than 5x are used.
Not all DNA is equally clonable, and different cloning systems have
different sensitivities to, for example, repetitive DNA or DNA of
marked composition bias. These difficult-to-clone segments of the
genome result in "holes" or gaps in the map. Some gaps may be
statistical artefacts: the DNA is clonable, but the clone has not yet
been isolated.
One significant gap in most types of library is that the telomeres
are not efficiently cloned. Specialised telomeric libraries must be
constructed for this purpose.
5 Ways of constructing genome maps
III: Fingerprinting
This is valid for lambda, cosmid, BAC/P1 and smaller YAC clones. In
practice cosmids have been used, but the technology is applicable to
other clone types, particularly BACs.
The fingerprint is generated by recording the pattern of restriction
fragments produced by the insert of each clone. The fingerprint is
visualised by labelling the restricted fragments with 32-P and
analysis on sequencing-type gels. The fingerprint is thus based on
the sequence of the clone insert, but sampled at low density. The
restriction and labelling regime followed depends critically on the
AT/GC content of the genome.
Example: the C. elegans genome map (Alan Coulson et
al)
(Coulson et al., 1988; Sulston et al., 1992; Wilson et
al., 1994)
The C. elegans genome project used cosmids and a Hind
III/Sau3A I fingerprinting method. 17,000 cosmids (a 6.8x library)
were miniprepped in 96 well format, and digested to completion with
Hind III. The Hind III sites were partially infilled with radioactive
32-P, and redigested with Sau3A I. The samples were denatured and
analysed on sequencing-type gels. The conditions used generate an
average of 30 bands per clone. Each clone lane was digitised and
stored in a database. Matching software was then used to find clones
which shared a significant number of bands (~ one third) and sets of
putatively overlapping clones or contigs predicted. This procedure
resulted in ~95-98% of the genome being cloned and organised into 500
contigs. As C. elegans only has six chromosomes, these contigs
obviously do not represent "chromosomes". The map was completed (or
"closure was achieved") by (i) anchoring contigs to the chromosomes
by hybridising cloned genes whose chromosomal location was known;
(ii) actively searching for cosmid or lambda clones which bridged the
gaps; and, most successfully, (iii) the use of YAC clones to link the
contigs. Steps (ii) and (iii) were accomplished by hybridising whole
cosmids or cosmid end probes to grids of clones.
Example: the Leishmania
major Map (Al Ivens et al)
(unpublished, but see Ivens and Little, 1995)
Leishmania have a genome of 35 kb, one third the size of C.
elegans. 10,000 cosmids were fingerprinted (an 11x library) and used
to generate a contig dataset as for C. elegans. In the Leishmania
project, a Hinf I digestion fingerprinting method was used, and
samples were run on nondenaturing gels. The map generated is
incomplete: many regions of the genome appear to be poorly clonable,
and the gene-rich segments are severely overrepresented (40 clones in
some cases). This problem is being addressed by hybridisation of end
probes and EST cDNAs to the gridded library.
6 Ways of constructing genome maps
IV: STS Content Mapping
Sequence tagged sites are regions of the genome which can be uniquely
identified through their sequence. originally this meant that they
had been sequenced, and identification was by PCR using primers
specific for the sequence. It is theoretically possible to use random
primers and tag clones by their "RAPD" content, but this has
analytical problems similar to those encountered in fingerprinting.
The term now includes the identification of STS by hybridisation of a
defined probe (which does not have to be sequenced). Mapping using
STS involves identifying the clones which contain a particular
sequence: if the sequence is single copy these must overlap. It is
formally similar to fingerprinting, but the data analysis is more
plus-minus than the quantitative matching required for fingerprints.
The problems are (i) repetitive sequences (which will give
multifurcating contigs) (ii) false positive scoring (which will give
bifurcations) and false negative scoring (which will result in failed
linking).
For hybridisation probes one can use
- (i) ESTs or other cDNAs [these have the advantage in
intron-containing eukaryotes of painting a genome segment
significantly longer than the cloned probe]
- (ii) other cloned sites [either randomly selected clones,
clones selected on the basis of polymorphisms or available cloned
genes]
- (iii) end-probes derived from the mapping clones [and
generated by single-sided PCR or anchored PCR from vector sites]
- (iv) whole mapping clones [whole cosmids have been used; this
approach is very sensitive to the presence of repeats]
- (v) defined sequence oligomers. Hybridisation of defined
sequence primers (where the sequence (20 bp) is selected randomly)
is predicted to work as well as the other longer probes but has
not been applied in extensive mapping projects.
The STS content mapping approach is applicable to all clone types.
It has been used on the longer (YAC, BAC/P1) ones in general.
There are two main "flavours" of hybridisation probe selection.
Random selection involves using whatever probes are available. In the
case of end-probes, this means generating probes from all of the
clones of a library. If done exhaustively, this will mean that all
possible links between all the clones will have been assessed, and
the best possible map generated. For a clone library of respectable
size (say 5000 clones for Brugia) this is 10,000 hybridisations. The
workload is similar to that involved in fingerprinting.
Directed probe selection is theoretically and practically a much more
efficient way of proceeding. It comes in two flavours, the "genome
walk" (or even "... crawl") and sampling-without-replacement. In a
walk, end probes are generated from a seed clone, and hybridising
clones identified. The contig is assembled, and clones predicted to
be outliers identified. End probes are made from these (two per
clone) and the library rescreened. This process is repeated until a
chromosome end or a gap is encountered, when a new seed is selected.
Progress is slow because of the iterative nature of the procedure,
and the walk progresses by approximately 0.5 x clone length per step.
Thus for a genome of 100 Mb and 100 kb clones, this is 2000
hybridisations, if everything works.
Sampling Without
Replacement
(Hoheisel et al., 1993; Mizukami et al., 1993; Palazzolo
et al., 1991)
Sampling without replacement describes a procedure where a seed clone
is used to generate end probes and the probes used to tag clones in
the library. A second seed is then selected from the subset of clones
which has not yet been tagged. This procedure is repeated until all
clones have been tagged. The process has been modelled on a 100 Mb
genome using a random 5x coverage library of 100 kb BACs. Initially,
a large number of short (~2.5 x clone length) contigs are generated
(reaching a maximum of ~250 contigs after 800 probings), but these
are then linked in the latter third of the effort into fewer (160)
larger (mean 600 Mb or 6x clone size) contigs.
In simulations and in practice (the genome of Schizosaccharomyces
pombe) the sampling without replacement technique is powerful and
efficient. While the random approach requires testing of 5000 clones
(2 probes per clone) the sampling without replacement method will
need only 1500 such probings to generate a closed library map. At
this point, the s-w-r method will have generated (modelled on a 100
Mb genome and given a good 100 kb BAC library with no significant
bias in coverage and representation) 160 contigs of mean size 600 kb,
while the random approach will have generated 500 contigs of 210
kb.
For the sample-without-replacement approach, contig gap closure can
be achieved using end-probes generated from clones predicted to be
outliers in the contigs. The S. pombe experience was that most of the
gaps they tried to close were "real" in the 5x library used (ie the
DNA was not cloned), and they recommend moving to larger insert size
libraries at this stage.
5 The Brugia
genome: Current Status and
Proposals
The Brugia
genome is 100 Mb in size arranged in 5
chromosome pairs.
If the contour length of the chromosomes reflects their
DNA content, the smallest chromosomes (a pair of small autosomes) are
in the region of 10 Mb and the largest (a putative X-chromosome 40
Mb).
- The Hha I repeat (30,000 copies of 322 bases)
makes up 9.7 Mb (and Mb = %) of the genome.
The Hha I repeat is arranged in approximately 10 tandem arrays, or
approximately one per telomere. By chromosomal in situ
hybridisation all chromosomes carry copies/a copy of the array.
- The ribosomal RNA cistron (lsu and ssu) (300
copies[?] of an 8 kb repeat) makes up 2.4 Mb.
- The 5S RNA/SL1 RNA repeat (100 copies[?] of an
0.6 kb repeat) makes up 0.06 Mb.
We have 7500 EST sequences (at ~350 bases/EST this is
~2.6 Mb of determined sequence).
- The ESTs probably derive from 4000 different
transcripts (at ~400 bases/transcript this is 1.6 Mb of unique
sequence).
Genes are likely to be spaced at 5-6 kb intervals,
and will occupy one third of the genomic DNA (based on C. elegans
sequence data, limited Brugia genomic sequence and the calculation:
[100 Mb genome - 10 Mb repeats] / 15000 genes = 6
kb/gene).
- If our ESTs derive from genes with the expected
number of introns then the 4000 genes correspond to
(optimistically) 20 Mb of the genome.
- 4000 ESTs correspond to an average EST marker
spacing of one per 25 kb, or four per 100 kb BAC.
- Most BACs should hybridise to at least one EST.
In a 5x library each EST should hybridise to 5-15 BACs
References:
Burke, D. T., Carle, G. F., and Olson, M. V. (1987). Cloning of large
segments of exogenous DNA into yeast by means of artifical chromosome
vectors. Science 236, 806-812.
Coulson, A., Waterston, R., Kiff, J., Sulston, J., and Kohara, Y.
(1988). Genome linking with yeast artificial chromosomes. Nature 335,
184-186.
Hoheisel, J. D., Maier, E., Mott, R., McCarthy, L., Grigoriev, A. V.,
Schalkwyk, L. C., Nizetic, D., Francis, F., and Lehrach, H. (1993).
High resolution cosmid and P1 maps spanning the 14 Mb genome of the
fission yeast S. pombe. Cell 109-120.
Ivens, A. C., and Little, P. F. R. (1995). Cosmid clones and their
application to genome studies. In Genome Analysis: A Practical
Approach: IRL Press), pp. 1-47.
Kim, U.-J., Shizuya, H., Kang, H.-L., Choi, S.-S., Garrett, C. L.,
Smink, L. J., Birren, B. W., Korenberg, J. R., Dunham, I., and Simon,
M. I. (1996). A bacterial artificial chromosome-based framework
contig map of human chromosome 22q. Proceedings of the National
Academy of Sciences, USA 93, 6297-6301.
Mizukami, T., Chang, W. I., Garkavtsev, I., Kaplan, N., Lombardi, D.,
Matsumoto, T., Niwa, O., Kounosu, A., Yanagida, M., Marr, T. G., and
Beach, D. (1993). A 13 kb resolution cosmid map of the 14 Mb fission
yeast genome by nonrandom sequence-tagged site mapping. Cell 73,
121-132.
Palazzolo, M. J., Sawyer, S. A., Martin, C. H., Smoller, D. A., and
Hartl, D. L. (1991). Optimised strategies for sequence-tagged site
selection in genome mapping. Proceedings of the National Academy of
Sciences USA 88, 8034-8038.
Sulston, J., Du, Z., Thomas, K., Wilson, R., Hillier, L., Staden, R.,
Halloran, N., Green, P., Thierry-Mieg, J., Qiu, L., Dear, S.,
Coulson, A., Craxton, M., Durbin, R., Berks, M., Metzstein, M.,
Hawkins, T., Ainscough, R., and Waterston, R. (1992). The C. elegans
genome sequencing project: A beginning. Nature 356, 37-41.
Wilson, R., Ainscough, R., Anderson, K., Baynes, C., Berks, M.,
Bonfield, J., Burton, J., Connell, M., Copsey, T., Cooper, J.,
Coulson, A., Craxton, M., Dear, S., Du, Z., Durbin, R., Favello, A.,
Fraser, A., Fulton, L., Gardner, A., Green, P., Hawkins, T., Hillier,
L., Jier, M., Johnston, L., Jones, M., Kershaw, J., Kirsten, J.,
Laisster, N., Latrielle, P., Lightning, J., Lloyd, C., Mortimore, B.,
O'Callaghan, M., Parsons, J., Percy, C., Rifken, L., Roopra, A.,
Saunders, D., Shownkeen, R., Sims, M., Smaldon, N., Smith, A., Smith,
M., Sonnhammer, E., Staden, R., Sulston, J., Thierry-Mieg, J.,
Thomas, K., Vaudin, M., Vaughan, K., Waterston, R., Watson, A.,
Weinstock, L., Wilkinson-Sproat, J., and Wohldman, P. (1994). 2.2 Mb
of contiguous nucleotide sequence from chromosome III of C. elegans.
Nature 368, 32-38.