Shotgun sequencing comes of age
Finished Drosophila sequence helps settle debate
over shotgun versus clone-by-clone sequencing. |
By Tabitha M Powledge
With little fanfare, the much-debated sequencing method known
as whole-genome shotgun (WGS) has become a conventional way
to sequence genomes. Two studies out this month help to confirm
its importance.
Early this month, the publicly funded mouse genome project
showed that WGS could yield a high-quality
draft sequence ” one superior
to the first draft of the human genome. And in a paper
published December 23 in Genome Biology (a publication
of The Scientist's partner, BioMed Central), Susan
Celniker and colleagues report that the WGS method has produced
a Drosophila sequence approaching the standard that
the US National
Human Genome Research Institute (NHGRI) has set for finished
sequence ” less than 1 error per
10,000 base pairs. This third, and "finished", version of
the Drosophila genome ” which
was the first metazoan genome sequenced predominantly by the
WGS method ” now averages 1.09 errors
per 10,000 base pairs.
"The study seems to answer one of the initial criticisms
of WGS, that the finishing stage would be more difficult.
Turns out it is not," S. Blair Hedges, who works on vertebrate
genome evolution at Pennsylvania
State University, told The Scientist.
WGS has been around for two decades, but became controversial
when Celera Genomics
announced it would use the method to produce a draft human
genome sequence faster than the publicly funded Human Genome
Project. The latter relied heavily on a different method,
usually known as clone-by-clone.
Eric
Lander of the Whitehead
Institute, one of the NHGRI-funded sequencing centers,
blames the controversy in part on journalists. "The WGS versus
clone-based sequence issue was so muddled by the press during
the Human Genome Project," he told The Scientist. Journalists
presented the debate as an argument over whether WGS would
work, although researchers, he said, always agreed that it
would work for the draft sequence. The issue was whether it
would be the best route for getting to finished sequence.
"It was a cost“benefit argument,"
he said — an argument that has
not yet been resolved. "The only way to know is to measure
the cost of finishing a mammalian genome both ways. The experiment
has been done for human by clone-by-clone, but not for WGS."
Draft sequences are useful for many purposes, but finished
sequences are essential for identifying the full set of genes
and regulatory regions and getting the correct sequence of
proteins, Lander pointed out. "Without this, you can't know
what you're missing, what apparent genes may be non-functional
pseudogenes. You also cannot study repeat sequences accurately.
And it is much harder to spot new mutations."
Finished sequence also permits verification and error correction,
and completes fragmented and fragmentary genes, according
to Mark Blaxter, of the Institute
of Cell, Animal and Population Biology in Edinburgh, UK.
The completed Caenorhabditis elegans genome, he told
The Scientist, was 100.3Mb in size and contained more
than 1000 additional protein-coding genes compared to the
97Mb first draft of 1998.
Lander argues that it is possible to finish a shotgun sequence
in organisms with few repeats, like bacteria or even Drosophila,
with 3% repeats in euchromatic regions. "But the evidence
suggests that clone-by-clone sequencing is required for organisms
with major repeats." That is particularly true of the human
sequence, in which 50% of the genome is repeats and, more
importantly, 5% represents nearly exact duplication, he said.
WGS smashes a genome into millions of bits, sequences the
bits, and localizes each one to a specific spot in the genome
by matching genetic markers in the bit to the same markers
on chromosomes. The clone-by-clone method breaks a genome
into largish chunks, clones the chunks into bacterial artificial
chromosomes (BACs), breaks the BAC DNA into smaller chunks,
matches their end sequences via computer programs, and then
localizes them in the genome with markers. It takes longer
than WGS and means sequencing thousands of BACs many times
to map a genome, but it has been regarded as more accurate.
WGS requires millions of sequence reads, too, but is believed
to be less expensive.
After all the high-profile discord, researchers have come
quietly to a consensus on sequencing: they want the best of
both approaches. Today's genome projects tend to combine the
two into hybrid strategies that are shaped by the complexity
of the genome under study and the way researchers are likely
to use the sequence information.
"We think shotgun sequencing is enough, certainly for organisms
whose sequence will be used primarily for comparative genomics
studies," Susan Celniker told The Scientist. Celniker
is co-director of the Berkeley
Drosophila Genome Project Sequencing Center at the University
of California, Berkeley. "BAC end-sequence is a necessary
component of the whole-genome shotgun-sequencing strategy
to build large sequence scaffolds, and the BAC fingerprints
are essential to verify the assembly, but making thousands
of BAC subclone libraries is unnecessary." Both zebrafish
and rat are being sequenced using hybrid strategies with a
whole-genome shotgun component, she pointed out. But, she
added, "As the publicly available sequence-assembly algorithms
and software improve, I think we will see the end of the hybrid
strategies."
"I believe that shotgun sequencing is great, but it doesn't
give an accurate account of the 40% or so of the genome of
mammals that is repetitive. So, there is definitely a need
for the sequencing of BAC clones to finish the job," said
Haig Kazazian, who chairs the genetics department at the University
of Pennsylvania and studies retrotransposons. "Speciation,
aging and other key processes may be affected by repeats,"
he told The Scientist.
Celniker and colleagues reported that, for some repeats,
neither WGS nor clone-by-clone works particularly well. Large
tandem repeats such as the histone cluster in Drosophila
are not solved by any simple method, and even in smaller tandem
arrays the number of copies is being estimated based on sizing
fragments using Southern blots, she said.
Financial calculations continue to drive individual decisions
about which methods to use for which projects. "Everyone in
genomics agrees that more sequence and better sequence is
better," Blaxter said. But only a limited number of bases
can be sequenced in a year. The result, he said, is a continual
tug-of-war between those who would like to see exhaustive
completion of one genome and those who would rather have draft
sequences of several.
Links for this article
T. Powledge, "Public mouse sequence published,"
The Scientist, December 4, 2002.
http://www.biomedcentral.com/news/20021204/08/
S.E. Celniker, et al, "Finishing a whole-genome
shotgun: Release 3 of the Drosophila melanogaster euchromatic
genome sequence," Genome Biology, 3(12), December
23, 2002.
http://genomebiology.com/2002/3/12/research/0079
National Human Genome Research Institute
http://www.genome.gov
Pennsylvania State University
http://www.psu.edu/
Celera Genomics
http://www.celera.com/
Eric S. Lander
http://www.wi.mit.edu/far/far_lander_bio.html
Whitehead Institute
http://www.wi.mit.edu/
Institute of Cell, Animal and Population Biology
http://www.icapb.ed.ac.uk/
Berkeley Drosophila Genome Project
http://www.fruitfly.org/index.html
Department of Genetics, University of Pennsylvania
http://www.uphs.upenn.edu/genetics/
|