Genome Mapping

"A map of the
Brugia genome by 1999"

discussion document prepared by Mark Blaxter
for the 1997 Filarial Genome Project meeting

The proposed strategy adopted is here.




[ what is a map | why a map? | restriction maps | clone types | fingerprinting | STS content | sample-without-replacement | the genome | references ]


1 What is a genome map?

an ordered set of (cloned) DNA fragments representing a genome or genome segment (chromosome)

The ultimate genome map is the complete sequence. This is being done for a number of organisms, including the nematode Caenorhabditis elegans. In practice, a map of cloned DNA fragments is constructed first before sequencing is started, though for smaller genomes (up to 3 Mb) it is possible to generate complete genome sequence from random shotgun clones. This approach has not been tried for larger genomes.


2 Why a genome map for Brugia?


3 Ways of constructing genome maps I: Restriction mapping.

This is achievable for small genomes (viruses (kb) to bacteria (Mb) and yeasts (15 Mb) and protozoan chromosomes (0.2 to 7 Mb, 35 Mb total))
For the upper end of this size range, it is necessary to use rare-cutting enzymes - either 8-base pair recognition site enzymes or 6-base pair recognition site enzymes where the base compositional bias of the genome makes the site rare.
For the larger genomes, hybridisation (with either random (sequence tagged site or STS) or selected gene probes are used to correlate fragments produced by different enzymes.
Often a restriction map is constructed in conjunction with a map made by other means, and the rare-cutting sites are then useful as genetic markers. Linking clones (ones which contain a site for the enzyme being used) can be isolated using a strong positive selection procedure.


4 Ways of constructing genome maps II: Cloned DNA

Genomic DNA can be cloned in a number of different vector systems. In practice four different types of vectors are of utility.

A genomic clone library is characterised by two parameters:

The depth is a measure of the number of "genome equivalents" contained within the library. Thus one genome equivalent of the B. malayi genome (100 Mb) in a 100 kb mean insert size BAC library is 1000 clones (this is referred to as a 1x library). Usually libraries of a depth greater than 5x are used.

Not all DNA is equally clonable, and different cloning systems have different sensitivities to, for example, repetitive DNA or DNA of marked composition bias. These difficult-to-clone segments of the genome result in "holes" or gaps in the map. Some gaps may be statistical artefacts: the DNA is clonable, but the clone has not yet been isolated.
One significant gap in most types of library is that the telomeres are not efficiently cloned. Specialised telomeric libraries must be constructed for this purpose.


5 Ways of constructing genome maps III: Fingerprinting

This is valid for lambda, cosmid, BAC/P1 and smaller YAC clones. In practice cosmids have been used, but the technology is applicable to other clone types, particularly BACs.
The fingerprint is generated by recording the pattern of restriction fragments produced by the insert of each clone. The fingerprint is visualised by labelling the restricted fragments with 32-P and analysis on sequencing-type gels. The fingerprint is thus based on the sequence of the clone insert, but sampled at low density. The restriction and labelling regime followed depends critically on the AT/GC content of the genome.

Example: the C. elegans genome map (Alan Coulson et al)

(Coulson et al., 1988; Sulston et al., 1992; Wilson et al., 1994)

The C. elegans genome project used cosmids and a Hind III/Sau3A I fingerprinting method. 17,000 cosmids (a 6.8x library) were miniprepped in 96 well format, and digested to completion with Hind III. The Hind III sites were partially infilled with radioactive 32-P, and redigested with Sau3A I. The samples were denatured and analysed on sequencing-type gels. The conditions used generate an average of 30 bands per clone. Each clone lane was digitised and stored in a database. Matching software was then used to find clones which shared a significant number of bands (~ one third) and sets of putatively overlapping clones or contigs predicted. This procedure resulted in ~95-98% of the genome being cloned and organised into 500 contigs. As C. elegans only has six chromosomes, these contigs obviously do not represent "chromosomes". The map was completed (or "closure was achieved") by (i) anchoring contigs to the chromosomes by hybridising cloned genes whose chromosomal location was known; (ii) actively searching for cosmid or lambda clones which bridged the gaps; and, most successfully, (iii) the use of YAC clones to link the contigs. Steps (ii) and (iii) were accomplished by hybridising whole cosmids or cosmid end probes to grids of clones.

Example: the Leishmania major Map (Al Ivens et al)

(unpublished, but see Ivens and Little, 1995)

Leishmania have a genome of 35 kb, one third the size of C. elegans. 10,000 cosmids were fingerprinted (an 11x library) and used to generate a contig dataset as for C. elegans. In the Leishmania project, a Hinf I digestion fingerprinting method was used, and samples were run on nondenaturing gels. The map generated is incomplete: many regions of the genome appear to be poorly clonable, and the gene-rich segments are severely overrepresented (40 clones in some cases). This problem is being addressed by hybridisation of end probes and EST cDNAs to the gridded library.


6 Ways of constructing genome maps IV: STS Content Mapping

Sequence tagged sites are regions of the genome which can be uniquely identified through their sequence. originally this meant that they had been sequenced, and identification was by PCR using primers specific for the sequence. It is theoretically possible to use random primers and tag clones by their "RAPD" content, but this has analytical problems similar to those encountered in fingerprinting. The term now includes the identification of STS by hybridisation of a defined probe (which does not have to be sequenced). Mapping using STS involves identifying the clones which contain a particular sequence: if the sequence is single copy these must overlap. It is formally similar to fingerprinting, but the data analysis is more plus-minus than the quantitative matching required for fingerprints. The problems are (i) repetitive sequences (which will give multifurcating contigs) (ii) false positive scoring (which will give bifurcations) and false negative scoring (which will result in failed linking).
For hybridisation probes one can use

The STS content mapping approach is applicable to all clone types. It has been used on the longer (YAC, BAC/P1) ones in general.
There are two main "flavours" of hybridisation probe selection. Random selection involves using whatever probes are available. In the case of end-probes, this means generating probes from all of the clones of a library. If done exhaustively, this will mean that all possible links between all the clones will have been assessed, and the best possible map generated. For a clone library of respectable size (say 5000 clones for Brugia) this is 10,000 hybridisations. The workload is similar to that involved in fingerprinting.

Directed probe selection is theoretically and practically a much more efficient way of proceeding. It comes in two flavours, the "genome walk" (or even "... crawl") and sampling-without-replacement. In a walk, end probes are generated from a seed clone, and hybridising clones identified. The contig is assembled, and clones predicted to be outliers identified. End probes are made from these (two per clone) and the library rescreened. This process is repeated until a chromosome end or a gap is encountered, when a new seed is selected. Progress is slow because of the iterative nature of the procedure, and the walk progresses by approximately 0.5 x clone length per step. Thus for a genome of 100 Mb and 100 kb clones, this is 2000 hybridisations, if everything works.


Sampling Without Replacement

(Hoheisel et al., 1993; Mizukami et al., 1993; Palazzolo et al., 1991)

Sampling without replacement describes a procedure where a seed clone is used to generate end probes and the probes used to tag clones in the library. A second seed is then selected from the subset of clones which has not yet been tagged. This procedure is repeated until all clones have been tagged. The process has been modelled on a 100 Mb genome using a random 5x coverage library of 100 kb BACs. Initially, a large number of short (~2.5 x clone length) contigs are generated (reaching a maximum of ~250 contigs after 800 probings), but these are then linked in the latter third of the effort into fewer (160) larger (mean 600 Mb or 6x clone size) contigs.

In simulations and in practice (the genome of Schizosaccharomyces pombe) the sampling without replacement technique is powerful and efficient. While the random approach requires testing of 5000 clones (2 probes per clone) the sampling without replacement method will need only 1500 such probings to generate a closed library map. At this point, the s-w-r method will have generated (modelled on a 100 Mb genome and given a good 100 kb BAC library with no significant bias in coverage and representation) 160 contigs of mean size 600 kb, while the random approach will have generated 500 contigs of 210 kb.

For the sample-without-replacement approach, contig gap closure can be achieved using end-probes generated from clones predicted to be outliers in the contigs. The S. pombe experience was that most of the gaps they tried to close were "real" in the 5x library used (ie the DNA was not cloned), and they recommend moving to larger insert size libraries at this stage.


5 The Brugia genome: Current Status and Proposals

The Brugia genome is 100 Mb in size arranged in 5 chromosome pairs.

If the contour length of the chromosomes reflects their DNA content, the smallest chromosomes (a pair of small autosomes) are in the region of 10 Mb and the largest (a putative X-chromosome 40 Mb).

We have 7500 EST sequences (at ~350 bases/EST this is ~2.6 Mb of determined sequence).

Genes are likely to be spaced at 5-6 kb intervals, and will occupy one third of the genomic DNA (based on C. elegans sequence data, limited Brugia genomic sequence and the calculation: [100 Mb genome - 10 Mb repeats] / 15000 genes = 6 kb/gene).


References:

Burke, D. T., Carle, G. F., and Olson, M. V. (1987). Cloning of large segments of exogenous DNA into yeast by means of artifical chromosome vectors. Science 236, 806-812.

Coulson, A., Waterston, R., Kiff, J., Sulston, J., and Kohara, Y. (1988). Genome linking with yeast artificial chromosomes. Nature 335, 184-186.

Hoheisel, J. D., Maier, E., Mott, R., McCarthy, L., Grigoriev, A. V., Schalkwyk, L. C., Nizetic, D., Francis, F., and Lehrach, H. (1993). High resolution cosmid and P1 maps spanning the 14 Mb genome of the fission yeast S. pombe. Cell 109-120.

Ivens, A. C., and Little, P. F. R. (1995). Cosmid clones and their application to genome studies. In Genome Analysis: A Practical Approach: IRL Press), pp. 1-47.

Kim, U.-J., Shizuya, H., Kang, H.-L., Choi, S.-S., Garrett, C. L., Smink, L. J., Birren, B. W., Korenberg, J. R., Dunham, I., and Simon, M. I. (1996). A bacterial artificial chromosome-based framework contig map of human chromosome 22q. Proceedings of the National Academy of Sciences, USA 93, 6297-6301.

Mizukami, T., Chang, W. I., Garkavtsev, I., Kaplan, N., Lombardi, D., Matsumoto, T., Niwa, O., Kounosu, A., Yanagida, M., Marr, T. G., and Beach, D. (1993). A 13 kb resolution cosmid map of the 14 Mb fission yeast genome by nonrandom sequence-tagged site mapping. Cell 73, 121-132.

Palazzolo, M. J., Sawyer, S. A., Martin, C. H., Smoller, D. A., and Hartl, D. L. (1991). Optimised strategies for sequence-tagged site selection in genome mapping. Proceedings of the National Academy of Sciences USA 88, 8034-8038.

Sulston, J., Du, Z., Thomas, K., Wilson, R., Hillier, L., Staden, R., Halloran, N., Green, P., Thierry-Mieg, J., Qiu, L., Dear, S., Coulson, A., Craxton, M., Durbin, R., Berks, M., Metzstein, M., Hawkins, T., Ainscough, R., and Waterston, R. (1992). The C. elegans genome sequencing project: A beginning. Nature 356, 37-41.

Wilson, R., Ainscough, R., Anderson, K., Baynes, C., Berks, M., Bonfield, J., Burton, J., Connell, M., Copsey, T., Cooper, J., Coulson, A., Craxton, M., Dear, S., Du, Z., Durbin, R., Favello, A., Fraser, A., Fulton, L., Gardner, A., Green, P., Hawkins, T., Hillier, L., Jier, M., Johnston, L., Jones, M., Kershaw, J., Kirsten, J., Laisster, N., Latrielle, P., Lightning, J., Lloyd, C., Mortimore, B., O'Callaghan, M., Parsons, J., Percy, C., Rifken, L., Roopra, A., Saunders, D., Shownkeen, R., Sims, M., Smaldon, N., Smith, A., Smith, M., Sonnhammer, E., Staden, R., Sulston, J., Thierry-Mieg, J., Thomas, K., Vaudin, M., Vaughan, K., Waterston, R., Watson, A., Weinstock, L., Wilkinson-Sproat, J., and Wohldman, P. (1994). 2.2 Mb of contiguous nucleotide sequence from chromosome III of C. elegans. Nature 368, 32-38.