HGP in the Sanger Centre

The Sanger Centre is a genome research centre founded by the Wellcome Trust and the Medical Research Council. Their  purpose is to further the knowledge of genomes, particularly through large scale sequencing and analysis.

In the HGP,the Sanger Centre is funded to sequence one third of the human genome (1000 Mb). Their effort has been focussed on chromosomes 1, 6, 9, 10, 13, 20, 22 and X (some of which are shared with other centres). We have placed particular emphasis on construction of long-range landmark maps containing a high density of markers(~15 markers/Mb). These maps were constructed using either overlapping yeast artificial chromosome (YAC) clones or by whole genome radiation hybrid (RH) mapping. The maps have included landmarks obtained from external public databases as far as possible, thus integrating the current data with information from other maps.

To increase the marker density where necessary, further sequence tagged sites (STS) have been generated by sequencing subclones of flow-sorted chromosomal DNA. P1 artificial chromosome clones (PACs) and also bacterial artificial chromosome (BAC), cosmid and fosmid clones are isolated from a range of libraries. Overlaps among them are determined by fingerprint data (either determined in-house, or more recently supplemented with data from the GSC database) and STS-content data. The resulting contigs are extended by walking. Minimally overlapping sets of bacterial clones are selected for sequencing.

The sequencing strategy that they employ for the working draft is to sequence pUC subclones using dye terminator chemistry. Sequence assembly is performed using PHRAP, and edited in a GAP4 database (R Staden et.al.) To complete the sequence the working draft data are supplemented with additional sequence reads generated from pUC or M13 sub-clones using dye terminator or dye primer chemistry. Reactions to double strand or contiguate are ordered using the FINISH programme prior to final problem solving and editing, which is undertaken by the finishers. Restriction digest data are used routinely to check assemblies and have proved particularly useful in repetitive regions. To deliver the working draft into the public domain early in 2000 the sequencing capacity has been increased five fold through the introdcution of PE3700 and MegaBace capillary sequencers.

'Unfinished' sequence data is available via ftp within 24 hrs of assembly into contigs longer than 1 kb, according to the Bermuda rules for data release. It is also submitted into the EMBL public database and updated whenever it changes. Once a clone has been 'finished' it is analysed by an integrated system combining sequence search methods and ab initio gene prediction algorithms with results imported into an Acedb database. An annotated version of the existing EMBL entry is then created from these results combined with manual annotation . Automatic functional annotation is provided by Pfam, the protein domain family database . Features are only annotated in EMBL entries if they are considered reliable, which means supported by homology data. Ab initio gene prediction methods are currently considered to be too inaccurate when applied to human sequence to be used directly for annotation. Since sequence databases are incomplete and homology methods are limited, annotation of genes is therefore incomplete.

The human genome sequence is now available, but in 'unfinished' or 'working draft' form. Providing an annotated view of this fragmentary and evolving data is a major challenge. Groups at the Sanger Centre and the European Bioinformatics Institute are addressing this through the creation of a new systematic database of genome functional annotation called Ensembl . Ensembl's aim is to provide the reference view for data from the human genome sequence as a freely available resource based on an 'open source' software environment. Ensembl will form the core of a worldwide collaboration of scientists and programmers dedicated to interpreting and exploiting genome data.

Alongside the completion of the human genome sequence, there are several new initiatives which aim to make the most of using large-scale approaches to utilise the sequence information in the next era of human genetics. The Sanger Centre is one of the major contributors to The SNP Consortium (TSC), a consortium of laboratories funded by the Wellcome Trust and 10 pharmaceutical companies  to use large-scale sequencing to identify 300,000 SNPs throughout the genome, and to place them in the public domain (completion by April 2001). Within the Sanger Centre, a microarray facility to support large-scale studies of gene-expression is being established, jointly funded by the Wellcome Trust, the Imperial Cancer Research Fund, and the Ludwig Institutes of Cancer Research. The initiation of the Cancer Genome Project was announced in October 1999, with the aim of studying human sequence variation and its involvement in tumour formation. These and other projects will form the basis for human genome research in the Sanger Centre during the next few years, as the complete reference sequence becomes available to underpin the next phase of human biology and genetics.

Click here to visit the Sanger Centre!