Get the Genesis
of Eden AV-CD by secure
internet order >> CLICK_HERE
Windows / Mac Compatible. Includes live video seminars, enchanting renewal songs and a thousand page illustrated codex.
The human genome holds an extraordinary trove of information about human development, physiology, medicine and evolution. Here we report the results of an international collaboration to produce and make freely available a draft sequence of the human genome. We also present an initial analysis of the data, describing some of the insights that can be gleaned from the sequence.
The rediscovery of Mendel's laws of heredity in the opening weeks of the 20th century 1±3 sparked a scientific quest to understand the nature and content of genetic information that has propelled biology for the last hundred years. The scientiÆc progress made falls naturally into four main phases, corresponding roughly to the four quarters of the century. The Ærst established the cellular basis of heredity: the chromosomes. The second deÆned the molecular basis of heredity: the DNA double helix. The third unlocked the informa- tional basis of heredity, with the discovery of the biological mechan- ism by which cells read the information contained in genes and with the invention of the recombinant DNA technologies of cloning and sequencing by which scientists can do the same. The last quarter of a century has been marked by a relentless drive to decipher Ærst genes and then entire genomes, spawning the Æeld of genomics. The fruits of this work already include the genome sequences of 599 viruses and viroids, 205 naturally occurring plasmids, 185 organelles, 31 eubacteria, seven archaea, one fungus, two animals and one plant.
Here we report the results of a collaboration involving 20 groups from the United States, the United Kingdom, Japan, France, Germany and China to produce a draft sequence of the human genome. The draft genome sequence was generated from a physical map covering more than 96% of the euchromatic part of the human genome and, together with additional sequence in public databases, it covers about 94% of the human genome. The sequence was produced over a relatively short period, with coverage rising from about 10% to more than 90% over roughly Æfteen months. The sequence data have been made available without restriction and updated daily throughout the project. The task ahead is to produce a finished sequence, by closing all gaps and resolving all ambiguities. Already about one billion bases are in final form and the task of bringing the vast majority of the sequence to this standard is now straightforward and should proceed rapidly.
The sequence of the human genome is of interest in several respects. It is the largest genome to be extensively sequenced so far, being 25 times as large as any previously sequenced genome and eight times as large as the sum of all such genomes. It is the Ærst vertebrate genome to be extensively sequenced. And, uniquely, it is the genome of our own species.
Much work remains to be done to produce a complete finished sequence, but the vast trove of information that has become available through this collaborative effort allows a global perspective on the human genome. Although the details will change as the sequence is finished, many points are already clear.
In this paper, we start by presenting background information on the project and describing the generation, assembly and evaluation of the draft genome sequence. We then focus on an initial analysis of the sequence itself: the broad chromosomal landscape; the repeat elements and the rich palaeontological record of evolutionary and biological processes that they provide; the human genes and proteins and their differences and similarities with those of other organisms; and the history of genomic segments. (Comparisons are drawn throughout with the genomes of the budding yeast Saccharomyces cerevisiae, the nematode worm Caenorhabditis elegans, the fruit fly Drosophila melanogaster and the mustard weed Arabidopsis thaliana; we refer to these for convenience simply as yeast, worm, fly and mustard weed.) Finally, we discuss applications of the sequence to biology and medicine and describe next steps in the project. A full description of the methods is provided as Supplementary Information on Nature's web site (http://www. nature.com). We recognize that it is impossible to provide a comprehensive analysis of this vast dataset, and thus our goal is to illustrate the range of insights that can be gleaned from the human genome and thereby to sketch a research agenda for the future. (4) The development of random shotgun sequencing of comple- mentary DNA fragments for high-throughput gene discovery by Schimmel 12 and Schimmel and Sutcliffe 13 , later dubbed expressed sequence tags (ESTs) and pursued with automated sequencing by Venter and others 14±20
A puzzling observation in the early days of molecular biology was that genome size does not correlate well with organismal complexity. For example, Homo sapiens has a genome that is 200 times as large as that of the yeast S. cerevisiae, but 200 times as small as that of Amoeba dubia 139,140. This mystery (the C-value paradox) was largely resolved with the recognition that genomes can contain a large quantity of repetitive sequence, far in excess of that devoted to protein-coding genes (reviewed in refs 140, 141).
In the human, coding sequences comprise less than 5% of the genome (see below), whereas repeat sequences account for at least 50% and probably much more. Broadly, the repeats fall into five classes: (1) transposon-derived repeats, often referred to as interspersed repeats; (2) inactive (partially) retroposed copies of cellular genes (including protein-coding genes and small structural RNAs), usually referred to as processed pseudogenes; (3) simple sequence repeats, consisting of direct repetitions of relatively short k-mers such as (A)n, (CA)n or (CGG)n; (4) segmental duplications, consisting of blocks of around 10±300 kb that have been copied from one region of the genome into another region; and (5) blocks of tandemly repeated sequences, such as at centromeres, telomeres, the short arms of acrocentric chromosomes and ribosomal gene clusters. (These regions are intentionally under-represented in the draft genome sequence and are not discussed here.)
Repeats are often described as `junk' and dismissed as uninteresting. However, they actually represent an extraordinary trove of information about biological processes. The repeats constitute a rich palaeontological record, holding crucial clues about evolutionary events and forces. As passive markers, they provide assays for studying processes of mutation and selection. It is possible to recognize cohorts of repeats `born' at the same time and to follow their fates in different regions of the genome or in different species. As active agents, repeats have reshaped the genome by causing ectopic rearrangements, creating entirely new genes, modifying and reshuffling existing genes, and modulating overall GC content. They also shed light on chromosome structure and dynamics, and provide tools for medical genetic and population genetic studies. The human is the first repeat-rich genome to be sequenced, and so we investigated what information could be gleaned from this majority component of the human genome. Although some of the general observations about repeats were suggested by previous studies, the draft genome sequence provides the first comprehensive view, allowing some questions to be resolved and new mysteries to emerge.
Most human repeat sequence is derived from transposable elements 142,143. We can currently recognize about 45% of the genome as belonging to this class. Much of the remaining `unique' DNA must also be derived from ancient transposable element copies that have diverged too far to be recognized as such. To describe our analyses of interspersed repeats, it is necessary briefly to review the relevant features of human transposable elements.
Classes of transposable elements. In mammals, almost all transposable elements fall into one of four types (Fig. 17), of which three transpose through RNA intermediates and one transposes directly as DNA. These are long interspersed elements (LINEs), short interspersed elements (SINEs), LTR retrotransposons and DNA transposons.
LINEs are one of the most ancient and successful inventions in eukaryotic genomes. In humans, these transposons are about 6 kb long, harbour an internal polymerase II promoter and encode two open reading frames (ORFs). Upon translation, a LINE RNA assembles with its own encoded proteins and moves to the nucleus, where an endonuclease activity makes a single-stranded nick and the reverse transcriptase uses the nicked DNA to prime reverse transcription from the 39 end of the LINE RNA. Reverse transcription frequently fails to proceed to the 59 end, resulting in many truncated, nonfunctional insertions. Indeed, most LINE-derived repeats are short, with an average size of 900 bp for all LINE1 copies, and a median size of 1,070 bp for copies of the currently active LINE1 element (L1Hs). New insertion sites are flanked by a small target site duplication of 7±20 bp. The LINE machinery is believed to be responsible for most reverse transcription in the genome, including the retrotransposition of the non-autonomous SINEs 144 and the creation of processed pseudogenes 145,146. Three distantly related LINE families are found in the human genome: LINE1, LINE2 and LINE3. Only LINE1 is still active. SINEs are wildly successful freeloaders on the backs of LINE elements. They are short (about 100±400 bp), harbour an internal polymerase III promoter and encode no proteins. These nonautonomous transposons are thought to use the LINE machinery for transposition. Indeed, most SINEs `live' by sharing the 39 end with a resident LINE element 144. The promoter regions of all known SINEs are derived from tRNA sequences, with the exception of a single monophyletic family of SINEs derived from the signal recognition particle component 7SL. This family, which also does not share its 39 end with a LINE, includes the only active SINE in the human genome: the Alu element. By contrast, the mouse has both tRNA-derived and 7SL-derived SINEs. The human genome contains three distinct monophyletic families of SINEs: the active Alu, and the inactive MIR and Ther2/MIR3. LTR retroposons are Øanked by long terminal direct repeats that contain all of the necessary transcriptional regulatory elements. The autonomous elements (retrotransposons) contain gag and pol genes, which encode a protease, reverse transcriptase, RNAse H and integrase. Exogenous retroviruses seem to have arisen from endogenous retrotransposons by acquisition of a cellular envelope gene (env) 147. Transposition occurs through the retroviral mechanism with reverse transcription occurring in a cytoplasmic virus-like particle, primed by a tRNA (in contrast to the nuclear location and chromosomal priming of LINEs). Although a variety of LTR retrotransposons exist, only the vertebrate-speciÆc endogenous retroviruses (ERVs) appear to have been active in the mammalian genome. Mammalian retroviruses fall into three classes (I±III), each comprising many families with independent origins. Most (85%) of the LTR retroposon-derived `fossils' consist only of an isolated LTR, with the internal sequence having been lost by homologous recombination between the Øanking LTRs. DNA transposons resemble bacterial transposons, having terminal inverted repeats and encoding a transposase that binds near the inverted repeats and mediates mobility through a `cut-and-paste' mechanism. The human genome contains at least seven major classes of DNA transposon, which can be subdivided into many families with independent origins148(see RepBase, http://www.girinst.org/,server/repbase.html). DNA transposons tend to have short life spans within a species. This can be explained by contrasting the modes of transposition of DNA transposons and LINE elements. LINE transposition tends to involve only functional elements, owing to the cis-preference by which LINE proteins assemble with the RNA from which they were translated. By contrast, DNA transposons cannot exercise a cis-preference: the encoded transposase is produced in the cytoplasm and, when it returns to the nucleus, it cannot distinguish active from inactive elements. As inactive copies accumulate in the genome, transposition becomes less efÆcient. This checks the expansion of any DNA transposon family and in due course causes it to die out. To survive, DNA transposons must eventually move by horizontal transfer to virgin genomes, and there is considerable evidence for such transfer 149±153 . Transposable elements employ different strategies to ensure their evolutionary survival. LINEs and SINEs rely almost exclusively on vertical transmission within the host genome 154 (but see refs 148, 155). DNA transposons are more promiscuous, requiring relatively frequent horizontal transfer. LTR retroposons use both strategies, with some being long-term active residents of the human genome (such as members of the ERVL family) and others having only short residence times.
Census of human repeats. We began by taking a census of the transposable elements in the draft genome sequence, using a recently updated version of the RepeatMasker program (version 09092000) run under sensitive settings (see http://repeatmasker. genome.washington.edu). This program scans sequences to identify full-length and partial members of all known repeat families represented in RepBase Update (version 5.08; see http://www. girinst.org/,server/repbase.html and ref. 156). Table 11 shows the number of copies and fraction of the draft genome sequence occupied by each of the four major classes and the main subclasses. The precise count of repeats is obviously underestimated because the genome sequence is not Ænished, but their density and other properties can be stated with reasonable conÆdence. Currently recognized SINEs, LINEs, LTR retroposons and DNA transposon copies comprise 13%, 20%, 8% and 3% of the sequence, respectively. We expect these densities to grow as more repeat families are recognized, among which will be lower copy number LTR elements and DNA transposons, and possibly high copy number ancient (highly diverged) repeats.
The age distribution of the repeats in the human genome provides a rich `fossil record' stretching over several hundred million years. The ancestry and approximate age of each fossil can be inferred by exploiting the fact that each copy is derived from, and therefore initially carried the sequence of, a then-active transposon and, being generally under no functional constraint, has accumulated mutations randomly and independently of other copies. We can infer the sequence of the ancestral active elements by clustering the modern derivatives into phylogenetic trees and building a consensus based on the multiple sequence alignment of a cluster of copies. Using available consensus sequences for known repeat subfamilies, we calculated the per cent divergence from the inferred ancestral active transposon for each of three million interspersed repeats in the draft genome sequence. The percentage of sequence divergence can be converted into an approximate age in millions of years (Myr) on the basis of evolutionary information. Care is required in calibrating the clock, because the rate of sequence divergence may not be constant over time or between lineages 139. The relative-rate test 157 can be used to calculate the sequence divergence that accumulated in a lineage after a given timepoint, on the basis of comparison with a sibling species that diverged at that time and an outgroup species. For example, the substitution rate over roughly the last 25 Myr in the human lineage can be calculated by using old world monkeys (which diverged about 25 Myr ago) as a sibling species and new world monkeys as an outgroup. We have used currently available calibrations for the human lineage, but the issue should be revisited as sequence information becomes available from different mammals. Figure 18a shows the representation of various classes of transposable elements in categories reflecting equal amounts of sequence divergence.
Figure 18 Age distribution of interspersed repeats in the human and mouse genomes. Bases covered by interspersed repeats were sorted by their divergence from their consensus sequence (which approximates the repeat's original sequence at the time of insertion). The average number of substitutions per 100 bp (substitution level, K ) was calculated from the mismatch level p assuming equal frequency of all substitutions (the one-parameter Jukes±Cantor model, K = -3/4ln(14/3p)). This model tends to underestimate higher substitution levels. CpG dinucleotides in the consensus were excluded from the substitution level calculations because the C ! T transition rate in CpG pairs is about tenfold higher than other transitions and causes distortions in comparing transposable elements with high and low CpG content. a, The distribution, for the human genome, in bins corresponding to 1% increments in substitution levels. b, The data grouped into bins representing roughly equal time periods of 25 Myr. c,d, Equivalent data for available mouse genomic sequence. There is a different correspondence between substitution levels and time periods owing to different rates of nucleotide substitution in the two species. The correspondence between substitution levels and time periods was largely derived from three-way species comparisons (relative rate test 139,157 ) with the age estimates based on fossil data. Human divergence from gibbon 20±30 Myr; old world monkey 25±35 Myr; prosimians 55±80 Myr; eutherian mammalian radiation ,100 Myr.
In Fig. 18b the data are grouped into four bins corresponding to successive 25-Myr periods, on the basis of an approximate clock. Figure 19 shows the mean ages of various subfamilies of DNA transposons. Several facts are apparent from these graphs. First, most interspersed repeats in the human genome predate the eutherian radiation. This is a testament to the extremely slow rate with which nonfunctional sequences are cleared from vertebrate genomes (see below concerning comparison with the Øy). Second, LINE and SINE elements have extremely long lives. The monophyletic LINE1 and Alu lineages are at least 150 and 80 Myr old, respectively. In earlier times, the reigning transposons were LINE2 and MIR 148,158 . The SINE MIR was perfectly adapted for reverse transcription by LINE2, as it carried the same 50-base sequence at its 39 end. When LINE2 became extinct 80±100 Myr ago, it spelled the doom of MIR. Third, there were two major peaks of DNA transposon activity (Fig. 19). The Ærst involved Charlie elements and occurred long before the eutherian radiation; the second involved Tigger elements and occurred after this radiation. Because DNA transposons can produce large-scale chromosome rearrangements 159±162, it is possible that widespread activity could be involved in speciation events. Fourth, there is no evidence for DNA transposon activity in the past 50 Myr in the human genome. The youngest two DNA transposon families that we can identify in the draft genome sequence (MER75 and MER85) show 6±7% divergence from their respective consensus sequences representing the ancestral element (Fig. 19), indicating that they were active before the divergence of humans and new world monkeys. Moreover, these elements were relatively unsuccessful, together contributing just 125 kb to the draft genome sequence.
Finally, LTR retroposons appear to be teetering on the brink of extinction, if they have not already succumbed. For example, the most proliÆc elements (ERVL and MaLRs) Øourished for more than 100 Myr but appear to have died out about 40 Myr ago 163,164. Only a single LTR retroposon family (HERVK10) is known to have transposed since our divergence from the chimpanzee 7 Myr ago, with only one known copy (in the HLA region) that is not shared between all humans165. In the draft genome sequence, we can identify only three full-length copies with all ORFs intact (the Ænal total may be slightly higher owing to the imperfect state of the draft genome sequence).
More generally, the overall activity of all transposons has declined markedly over the past 35±50 Myr, with the possible exception of LINE1 (Fig. 18). Indeed, apart from an exceptional burst of activity of Alus peaking around 40 Myr ago, there would appear to have been a fairly steady decline in activity in the hominid lineage since the mammalian radiation. The extent of the decline must be even greater than it appears because old repeats are gradually removed by random deletion and because old repeat families are harder to recognize and likely to be under-represented in the repeat databases. (We conÆrmed that the decline in transposition is not an artefact arising from errors in the draft genome sequence, which, in principle, could increase the divergence level in recent elements. First, the sequence error rate (Table 9) is far too low to have a signiÆcant effect on the apparent age of recent transposons; and second, the same result is seen if one considers only Ænished sequence.)
What explains the decline in transposon activity in the lineage leading to humans? We return to this question below, in the context of the observation that there is no similar decline in the mouse genome.
Comparison with other organisms.
We compared the complement of transposable elements in the human genome with those of the other sequenced eukaryotic genomes. We analysed the Øy, worm and mustard weed genomes for the number and nature of repeats (Table 12) and the age distribution (Fig. 20). (For the fly, we analysed the 114 Mb of unÆnished `large' contigs produced by the whole-genome shotgun assembly 166, which are reported to represent euchromatic sequence. Similar results were obtained by analysing 30 Mb of finished euchromatic sequence.) The human genome stands in stark contrast to the genomes of the other organisms.
(1) The euchromatic portion of the human genome has a much higher density of transposable element copies than the euchromatic DNA of the other three organisms. The repeats in the other organisms may have been slightly underestimated because the repeat databases for the other organisms are less complete than for the human, especially with regard to older elements; on the other hand, recent additions to these databases appear to increase the repeat content only marginally.
(2) The human genome is Ælled with copies of ancient transposons, whereas the transposons in the other genomes tend to be of more recent origin. The difference is most marked with the fly, but is clear for the other genomes as well. The accumulation of old repeats is likely to be determined by the rate at which organisms engage in `housecleaning' through genomic deletion. Studies of pseudogenes have suggested that small deletions occur at a rate that is 75-fold higher in flies than in mammals; the half-life of such nonfunctional DNA is estimated at 12 Myr for flies and 800 Myr for mammals 167 . The rate of large deletions has not been systematically compared, but seems likely also to differ markedly. (3) Whereas in the human two repeat families (LINE1 and Alu) account for 60% of all interspersed repeat sequence, the other organisms have no dominant families. Instead, the worm, fky and mustard weed genomes all contain many transposon families, each consisting of typically hundreds to thousands of elements. This difference may be explained by the observation that the vertically transmitted, long-term residential LINE and SINE elements represent 75% of interspersed repeats in the human genome, but only 5± 25% in the other genomes. In contrast, the horizontally transmitted and shorter-lived DNA transposons represent only a small portion of all interspersed repeats in humans (6%) but a much larger fraction in fly, mustard weed and worm (25%, 49% and 87%, respectively). These features of the human genome are probably general to all mammals. The relative lack of horizontally transmitted elements may have its origin in the well developed immune system of mammals, as horizontal transfer requires infectious vectors, such as viruses, against which the immune system guards. We also looked for differences among mammals, by comparing the transposons in the human and mouse genomes. As with the human genome, care is required in calibrating the substitution clock for the mouse genome. There is considerable evidence that the rate of substitution per Myr is higher in rodent lineages than in the hominid lineages139,168,169. In fact, we found clear evidence for different rates of substitution by examining families of transposable elements whose insertions predate the divergence of the human and mouse lineages. In an analysis of 22 such families, we found that the substitution level was an average of 1.7-fold higher in mouse than human (not shown). (This is likely to be an underestimate because of an ascertainment bias against the most diverged copies.) The faster clock in mouse is also evident from the fact that the ancient LINE2 and MIR elements, which transposed before the mammalian radiation and are readily detectable in the human genome, cannot be readily identiÆed in available mouse genomic sequence (Fig. 18). We used the best available estimates to calibrate substitution levels and time169. The ratio of substitution rates varied from about 1.7-fold higher over the past 100 Myr to about 2.6-fold higher over the past 25 Myr.
The analysis shows that, although the overall density of the four transposon types in human and mouse is similar, the age distribution is strikingly different (Fig. 18). Transposon activity in the mouse genome has not undergone the decline seen in humans and proceeds at a much higher rate. In contrast to their possible extinction in humans, LTR retroposons are alive and well in the mouse with such representatives as the active IAP family and putatively active members of the long-lived ERVL and MaLR families. LINE1 and a variety of SINEs are quite active. These evolutionary Ændings are consistent with the empirical observations that new spontaneous mutations are 30 times more likely to be caused by LINE insertions in mouse than in human (,3% versus 0.1%)170 and 60 times more likely to be caused by transposable elements in general. It is estimated that around 1 in 600 mutations in human are due to transpositions, whereas 10% of mutations in mouse are due to transpositions (mostly IAP insertions). The contrast between human and mouse suggests that the explanation for the decline of transposon activity in humans may lie in some fundamental difference between hominids and rodents. Population structure and dynamics would seem to be likely suspects. Rodents tend to have large populations, whereas hominid populations tend to be small and may undergo frequent bottlenecks. Evolutionary forces affected by such factors include inbreeding and genetic drift, which might affect the persistence of active transposable elements 171 . Studies in additional mammalian lineages may shed light on the forces responsible for the differences in the activity of transposable elements 172 . Variation in the distribution of repeats. We next explored variation in the distribution of repeats across the draft genome sequence, by calculating the repeat density in windows of various sizes across the genome. There is striking variation at smaller scales. Some regions of the genome are extraordinarily dense in repeats. The prizewinner appears to be a 525-kb region on chromosome Xp11, with an overall transposable element density of 89%. This region contains a 200-kb segment with 98% density, as well as a segment of 100 kb in which LINE1 sequences alone comprise 89% of the sequence. In addition, there are regions of more than 100 kb with extremely high densities of Alu (.56% at three loci, including one on 7q11 with a 50-kb stretch of .61% Alu) and the ancient transposons MIR (.15% on chromosome 1p36) and LINE2 (.18% on chromosome 22q12). In contrast, some genomic regions are nearly devoid of repeats.
The absence of repeats may be a sign of large-scale cis-regulatory elements that cannot tolerate being interrupted by insertions. The four regions with the lowest density of interspersed repeats in the human genome are the four homeobox gene clusters, HOXA, HOXB, HOXC and HOXD (Fig. 21). Each locus contains regions of around 100 kb containing less than 2% interspersed repeats. Ongoing sequence analysis of the four HOX clusters in mouse, rat and baboon shows a similar absence of transposable elements, and reveals a high density of conserved noncoding elements (K. Dewar and B. Birren, manuscript in preparation). The presence of a complex collection of regulatory regions may explain why individual HOX genes carried in transgenic mice fail to show proper regulation.
It may be worth investigating other repeat-poor regions, such as a region on chromosome 8q21 (1.5% repeat over 63 kb) containing a gene encoding a homeodomain zinc-finger protein (homologous to mouse pID 9663936), a region on chromosome 1p36 (5% repeat over 100 kb) with no obvious genes and a region on chromosome 18q22 (4% over 100 kb) containing three genes of unknown function (among which is KIAA0450). It will be interesting to see whether the homologous regions in the mouse genome have similarly resisted the insertion of transposable elements during rodent evolution.
Distribution by GC content. We next focused on the correlation between the nature of the transposons in a region and its GC content. We calculated the density of each repeat type as a function of the GC content in 50-kb windows (Fig. 22). As has been reported 142,173±176, LINE sequences occur at much higher density in AT-rich regions (roughly fourfold enriched), whereas SINEs (MIR, Alu) show the opposite trend (for Alu, up to Ævefold lower in ATrich DNA). LTR retroposons and DNA transposons show a more uniform distribution, dipping only in the most GC-rich regions. The preference of LINEs for AT-rich DNA seems like a reasonable way for a genomic parasite to accommodate its host, by targeting gene-poor AT-rich DNA and thereby imposing a lower mutational burden. Mechanistically, selective targeting is nicely explained by the fact that the preferred cleavage site of the LINE endonuclease is TTTT/A (where the slash indicates the point of cleavage), which is used to prime reverse transcription from the poly(A) tail of LINE RNA 177.
The contrary behaviour of SINEs, however, is bafØing. How do SINEs accumulate in GC-rich DNA, particularly if they depend on the LINE transposition machinery178? Notably, the same pattern is seen for the Alu-like B1 and the tRNA-derived SINEs in mouse and for MIR in human142. One possibility is that SINEs somehow target GC-rich DNA for insertion. The alternative is that SINEs initially insert with the same proclivity for AT-rich DNA as LINEs, but that the distribution is subsequently reshaped by evolutionary forces142,179.
We used the draft genome sequence to investigate this mystery by comparing the proclivities of young, adolescent, middle-aged and old Alus (Fig. 23). Strikingly, recent Alus show a preference for ATrich DNA resembling that of LINEs, whereas progressively older Alus show a progressively stronger bias towards GC-rich DNA. These results indicate that the GC bias must result from strong pressure: Fig. 23 shows that a 13-fold enrichment of Alus in GC-rich DNA has occurred within the last 30 Myr, and possibly more recently.
These results raise a new mystery. What is the force that produces the great and rapid enrichment of Alus in GC-rich DNA? One explanation may be that deletions are more readily tolerated in gene-poor AT-rich regions than in gene-rich GC-rich regions, resulting in older elements being enriched in GC-rich regions. Such an enrichment is seen for transposable elements such as DNA transposons (Fig. 24). However, this effect seems too slow and too small to account for the observed remodelling of the Alu distribution. This can be seen by performing a similar analysis for LINE elements (Fig. 25). There is no signiÆcant change in the LINE distribution over the past 100 Myr, in contrast to the rapid change seen for Alu. There is an eventual shift after more than 100 Myr, although its magnitude is still smaller than seen for Alus. These observations indicate that there may be some force acting particularly on Alus. This could be a higher rate of random loss of Alus in AT-rich DNA, negative selection against Alus in AT-rich DNA or positive selection in favour of Alus in GC-rich DNA. The Ærst two possibilities seem unlikely because AT-rich DNA is gene-poor and tolerates the accumulation of other transposable elements. The third seems more feasible, in that it involves selecting in favour of the minority of Alus in GC-rich regions rather than against the majority that lie in AT-rich regions. But positive selection for Alus in GC-rich regions would imply that they benefit the organism.
Schmid180 has proposed such a function for SINEs. This hypothesis is based on the observation that in many species SINEs are transcribed under conditions of stress, and the resulting RNAs specifically bind a particular protein kinase (PKR) and block its ability to inhibit protein translation 181±183. SINE RNAs would thus promote protein translation under stress. SINE RNA may be well suited to such a role in regulating protein translation, because it can be quickly transcribed in large quantities from thousands of elements and it can function without protein translation. Under this theory, there could be positive selection for SINEs in readily transcribed open chromatin such as is found near genes. This could explain the retention of Alus in gene-rich GC-rich regions. It is also consistent with the observation that SINE density in AT-rich DNA is higher near genes142.
Further insight about Alus comes from the relationship between Alu density and GC content on individual chromosomes (Fig. 26). There are two outliers. Chromosome 19 is even richer in Alus than predicted by its (high) GC content; the chromosome comprises 2% of the genome, but contains 5% of Alus. On the other hand, chromosome Y shows the lowest density of Alus relative to its GC content, being higher than average for GC content less than 40% and lower than average for GC content over 40%. Even in AT-rich DNA, Alus are under-represented on chromosome Y compared with other young interspersed repeats (see below). These phenomena may be related to an unusually high gene density on chromosome 19 and an unusually low density of somatically active genes on chromosome Y (both relative to GC content). This would be consistent with the idea that Alu correlates not with GC content but with actively transcribed genes.
Our results may support the controversial idea that SINEs actually earn their keep in the genome. Clearly, much additional work will be needed to prove or disprove the hypothesis that SINEs are genomic symbionts. Biases in human mutation. Indirect studies have suggested that nucleotide substitution is not uniform across mammalian
Figure 25 Distribution of various LINE cohorts as a function of local GC content. The divergence levels and ages of the cohorts are shown in the key. (The divergence levels were measured for the 39 UTR of the LINE1 element only, which is best characterized evolutionarily. This region contains almost no CpG sites, and thus 1% divergence level corresponds to a much longer time than for CpG-rich Alu copies).
genomes184±187. By studying sets of repeat elements belonging to a common cohort, one can directly measure nucleotide substitution rates in different regions of the genome. We find strong evidence that the pattern of neutral substitution differs as a function of local GC content (Fig. 27). Because the results are observed in repetitive elements throughout the genome, the variation in the pattern of nucleotide substitution seems likely to be due to differences in the underlying mutational process rather than to selection. The effect can be seen most clearly by focusing on the substitution processing $ a, where g denotes GC or CG base pairs and a denotes AT or TA base pairs. If K is the equilibrium constant in the direction of a base pairs (defined by the ratio of the forward and reverse rates), then the equilibrium GC content should be 1/(1 + K). Two observations emerge.
First, there is a regional bias in substitution patterns. The equilibrium constant varies as a function of local GC content: g base pairs are more likely to mutate towards a base pairs in AT-rich regions than in GC-rich regions. For the analysis in Fig. 27, the equilibrium constant K is 2.5, 1.9 and 1.2 when the draft genome sequence is partitioned into three bins with average GC content of 37, 43 and 50%, respectively. This bias could be due to a reported tendency for GC-rich regions to replicate earlier in the cell cycle than AT-rich regions and for guanine pools, which are limiting for base pairs substitutions are due to differences in DNA repair mechanisms, possibly related to transcriptional activity and thereby to gene density and GC content185,189,190.
There is also an absolute bias in substitution patterns resulting in directional pressure towards lower GC content throughout the human genome. The genome is not at equilibrium with respect to the pattern of nucleotide substitution: the expected equilibrium GC content corresponding to the values of Kabove is 29, 35 and 44% for regions with average GC contents of 37, 43 and 50%, respectively. Recent observations on SNPs 190 confirm that the mutation pattern in GC-rich DNA is biased towards a base pairs; it should be possible to perform similar analyses throughout the genome with the availability of 1.4 million SNPs 97,191. On the basis solely of nucleotide substitution patterns, the GC content would be expected to be about 7% lower throughout the genome. What accounts for the higher GC content? One possible explanation is that in GC-rich regions, a considerable fraction of the nucleotides is likely to be under functional constraint owing to the high gene density. Selection on coding regions and regulatory CpG islands may maintain the higher-than-predicted GC content. Another is that throughout the rest of the genome, a constant inØux of transposable elements tends to increase GC content (Fig. 28). Young repeat elements clearly have a higher GC content than their surrounding regions, except in extremely GC-rich regions. Moreover, repeat elements clearly shift with age towards a lower GC content, closer to that of the neighbourhood in which they reside. Much of the `non-repeat' DNA in AT-rich regions probably consists of ancient repeats that are not detectable by current methods and that have had more time to approach the local equilibrium value. The repeats can also be used to study how the mutation process is affected by the immediately adjacent nucleotide. Such `context effects' will be discussed elsewhere (A. Kas and A. F. A. Smit, unpublished results). Fast living on chromosome Y. The pattern of interspersed repeats can be used to shed light on the unusual evolutionary history of chromosome Y. Our analysis shows that the genetic material on chromosome Y is unusually young, probably owing to a high tolerance for gain of new material by insertion and loss of old material by deletion. Several lines of evidence support this picture. For example, LINE elements on chromosome Yare on average much younger than those on autosomes (not shown). Similarly, MaLRfamily retroposons on chromosome Y are younger than those on autosomes, with the representation of subfamilies showing a strong inverse correlation with the age of the subfamily. Moreover, chromosome Y has a relative over-representation of the younger retroviral class II (ERVK) and a relative under-representation of the primarily older class III (ERVL) compared with other chromosomes. Overall, chromosome Y seems to maintain a youthful appearance by rapid turnover.
Interspersed repeats on chromosome Y can also be used to estimate the relative mutation rates, am and af, in the male and female germlines. Chromosome Y always resides in males, whereas chromosome X resides in females twice as often as in males. The substitution rates, mY and mX, on these two chromosomes should thus be in the ratio mY:mX = (am):(am + 2af )/3, provided that one considers equivalent neutral sequences. Several authors have estimated the mutation rate in the male germline to be Ævefold higher than in the female germline, by comparing the rates of evolution of Xand Y-linked genes in humans and primates. However, Page and colleagues192 have challenged these estimates as too high. They studied a 39-kb region that is apparently devoid of genes and resides within a large segmental duplication from X to Y that occurred 3±4 Myr ago in the human lineage. On the basis of phylogenetic analysis of the sequence on human Yand human, chimp and gorilla X, they obtained a much lower estimate of mY:mX = 1.36, corresponding to a m :a f = 1.7. They suggested that the other estimates may have been higher because they were based on much longer evolutionary periods or because the genes studied may have been under selection. Our database of human repeats provides a powerful resource for addressing this question. We identiÆed the repeat elements from recent subfamilies (effectively, birth cohorts dating from the past 50 Myr) and measured the substitution rates for subfamily members on chromosomes X and Y (Fig. 29). There is a clear linear relationship with a slope of m Y :m X = 1.57 corresponding to a m :a f = 2.1. The estimate is in reasonable agreement with that of Page et al., although it is based on much more total sequence (360 kb on Y, 1.6 Mb on X) and a much longer time period. In particular, the discrepancy with earlier reports is not explained by recent changes in the human lineage. Various theories have been proposed for the higher mutation rate in the male germline, including the greater number of cell divisions in the formation of sperm than eggs and different repair mechanisms in sperm and eggs. Active transposons. We were interested in identifying the youngest retrotransposons in the draft genome sequence. This set should contain the currently active retrotransposons, as well as the insertion sites that are still polymorphic in the human population. The youngest branch in the phylogenetic tree of human LINE1 elements is called L1Hs (ref. 158); it differs in its 39 untranslated region (UTR) by 12 diagnostic substitutions from the next oldest subfamily (L1PA2). Within the L1Hs family, there are two subsets referred to as Ta and pre-Ta, deÆned by a diagnostic trinucleotide 193,194 . All active L1 elements are thought to belong to these two subsets, because they account for all 14 known cases of human disease arising from new L1 transposition (with 13 belonging to the Ta subset and one to the pre-Ta subset) 195,196 . These subsets are also of great interest for population genetics because at least 50% are still segregating as polymorphisms in the human population 194,197; they provide powerful markers for tracing population history because they represent unique (non-recurrent and non-revertible) genetic events that can be used (along with similarly polymorphic Alus) for reconstructing human migrations. LINE1 elements that are retrotransposition-competent should consist of a full-length sequence and should have both ORFs intact. Eleven such elements from the Ta subset have been identiÆed, including the likely progenitors of mutagenic insertions into the factor VIII and dystrophin genes198±202. A cultured cell retrotransposition assay has revealed that eight of these elements remain retrotransposition-competent 200,202,203.
We searched the draft genome sequence and identified 535 LINEs belonging to the Ta subset and 415 belonging to the pre-Ta subset. These elements provide a large collection of tools for probing human population history. We also identiÆed those consisting of full-length elements with intact ORFs, which are candidate active LINEs. We found 39 such elements belonging to the Ta subset and 22 belonging to the pre-Ta subset; this substantially increases the number in the Ærst category and provides the Ærst known examples in the second category. These elements can now be tested for retrotransposition competence in the cell culture assay. Preliminary analysis resulted in the identiÆcation of two of these elements as the likely progenitors of mutagenic insertions into the b-globin and RP2 genes (R. Badge and J. V. Moran, unpublished data). Similar analyses should allow the identiÆcation of the progenitors of most, if not all, other known mutagenic L1 insertions.
L1 elements can carry extra DNA if transcription extends through the native transcriptional termination site into Øanking genomic DNA. This process, termed L1-mediated transduction, provides a means for the mobilization of DNA sequences around the genome and may be a mechanism for `exon shufØing' 204. Twenty-one percent of the 71 full-length L1s analysed contained non-L1-derived sequences before the 39 target-site duplication site, in cases in which the site was unambiguously recognizable. The length of the transduced sequence was 30±970 bp, supporting the suggestion that 0.5± 1.0% of the human genome may have arisen by LINE-based transduction of 39 Øanking sequences 205,206. Our analysis also turned up two instances of 59 transduction (145 bp and 215 bp). Although this possibility had been suggested on the basis of cell culture models 195,203, these are the first documented examples. Such events may arise from transcription initiating in a cellular promoter upstream of the L1 elements. L1 transcription is generally conÆned to the germline 207,208, but transcription from other promoters could explain a somatic L1 retrotransposition event that resulted in colon cancer 206. Transposons as a creative force. The primary force for the origin and expansion of most transposons has been selection for their ability to create progeny, and not a selective advantage for the host. However, these selÆsh pieces of DNA have been responsible for important innovations in many genomes, for example by contributing regulatory elements and even new genes. Twenty human genes have been recognized as probably derived from transposons142,209. These include the RAG1 and RAG2 recombinases and the major centromere-binding protein CENPB. We scanned the draft genome sequence and identiÆed another 27 cases, bringing the total to 47 (Table 13; refs 142, 209). All but four are derived from DNA transposons, which give rise to only a small proportion of the interspersed repeats in the genome. Why there are so many DNA transposase-like genes, many of which still contain the critical residues for transposase activity, is a mystery. To illustrate this concept, we describe the discovery of one of the new examples. We searched the draft genome sequence to identify the autonomous DNA transposon responsible for the distribution of the non-autonomous MER85 element, one of the most recently (40±50 Myr ago) active DNA transposons. Most non-autonomous elements are internal deletion products of a DNA transposon. We identiÆed one instance of a large (1,782 bp) ORF Øanked by the 59 and 39 halves of a MER85 element. The ORF encodes a novel protein (partially published as pID 6453533) whose closest homologue is the transposase of the piggyBac DNA transposon, which is found in insects and has the same characteristic TTAA target-site duplications 210 as MER85. The ORF is actively transcribed in fetal brain and in cancer cells. That it has not been lost to mutation in 40±50 Myr of evolution (whereas the Øanking, noncoding, MER85like termini show the typical divergence level of such elements) and is actively transcribed provides strong evidence that it has been adopted by the human genome as a gene. Its function is unknown. LINE1 activity clearly has also had fringe beneÆts. We mentioned above the possibility of exon reshufØing by cotranscription of neighbouring DNA. The LINE1 machinery can also cause reverse transcription of genic mRNAs, which typically results in nonfunctional processed pseudogenes but can, occasionally, give rise to functional processed genes. There are at least eight human and eight mouse genes for which evidence strongly supports such an origin211 (see http://www-ifi.uni-muenster.de/exapted-retrogenes/ tables.html). Many other intronless genes may have been created in the same way.
Transposons have made other creative contributions to the genome. A few hundred genes, for example, use transcriptional terminators donated by LTR retroposons (data not shown). Other genes employ regulatory elements derived from repeat elements 211.
Simple sequence repeats Simple sequence repeats (SSRs) are a rather different type of repetitive structure that is common in the human genomeperfect or slightly imperfect tandem repeats of a particular k-mer. SSRs with a short repeat unit (n = 1±13 bases) are often termed microsatellites, whereas those with longer repeat units (n = 14±500 bases) are often termed minisatellites. With the exception of poly(A) tails from reverse transcribed messages, SSRs are thought to arise by slippage during DNA replication 212,213.
We compiled a catalogue of all SSRs over a given length in the human draft genome sequence, and studied their properties (Table 14). SSRs comprise about 3% of the human genome, with the greatest single contribution coming from dinucleotide repeats (0.5%). (The precise criteria for the number of repeat units and the extent of divergence allowed in an SSR affect the exact census, but not the qualitative conclusions.) There is approximately one SSR per 2 kb (the number of nonoverlapping tandem repeats is 437 per Mb). The catalogue conÆrms various properties of SSRs that have been inferred from sampling approaches (Table 15). The most frequent dinucleotide repeats are AC and AT (50 and 35% of dinucleotide repeats, respectively), whereas AG repeats (15%) are less frequent and GC repeats (0.1%) are greatly under-represented. The most frequent trinucleotides are AAT and AAC (33% and 21%, respectively), whereas ACC (4.0%), AGC (2.2%), ACT (1.4%) and ACG (0.1%) are relatively rare. Overall, trinucleotide SSRs are much less frequent than dinucleotide SSRs 214.
SSRs have been extremely important in human genetic studies, because they show a high degree of length polymorphism in the human population owing to frequent slippage by DNA polymerase during replication. Genetic markers based on SSRsparticularly (CA)n repeatshave been the workhorse of most human diseasemapping studies101,102. The availability of a comprehensive catalogue of SSRs is thus a boon for human genetic studies. The SSR catalogue also allowed us to resolve a mystery regarding mammalian genetic maps. Such genetic maps in rat, mouse and human have a deficit of polymorphic (CA)n repeats on chromosome X 30,101. There are two possible explanations for this deficit. There may simply be fewer (CA)n repeats on chromosome X; or (CA)n repeats may be as dense on chromosome X but less polymorphic in donor and recipient regions of the genome are often not tandemly arranged, suggesting mechanisms other than unequal crossing-over for their origin. They are relatively recent, inasmuch as strong sequence identity is seen in both exons and introns (in contrast to regions that are considered to show evidence of ancient duplications, characterized by similarities only in coding regions). Indeed, many such duplications appear to have arisen in very recent evolutionary time, as judged by high sequence identity and by their absence in closely related species.
Segmental duplications can be divided into two categories. First, interchromosomal duplications are deÆned as segments that are duplicated among nonhomologous chromosomes. For example, a 9.5-kb genomic segment of the adrenoleukodystrophy locus from Xq28 has been duplicated to regions near the centromeres of chromosomes 2, 10, 16 and 22 (refs 218, 219). Anecdotal observations suggest that many interchromosomal duplications map near the centromeric and telomeric regions of human chromosomes 218±233.
The second category is intrachromosomal duplications, which occur within a particular chromosome or chromosomal arm. This category includes several duplicated segments, also known as low copy repeat sequences, that mediate recurrent chromosomal structural rearrangements associated with genetic disease 215,217. Examples on chromosome 17 include three copies of a roughly 200-kb repeat separated by around 5 Mb and two copies of a roughly 24-kb repeat separated by 1.5 Mb. The copies are so similar (99% identity) that paralogous recombination events can occur, giving rise to contiguous gene syndromes: Smith±Magenis syndrome and Charcot± Marie±Tooth syndrome 1A, respectively 34,234. Several other examples are known and are also suspected to be responsible for recurrent microdeletion syndromes (for example, Prader±Willi/Angelman, velocardiofacial/DiGeorge and Williams' syndromes 215,235±240).
Until now, the identification and characterization of segmental duplications have been based on anecdotal reportsfor example, finding that certain probes hybridize to multiple chromosomal sites or noticing duplicated sequence at certain recurrent chromosomal breakpoints. The availability of the entire genomic sequence will make it possible to explore the nature of segmental duplications more systematically. This analysis can begin with the current state of the draft genome sequence, although caution is required because some apparent duplications may arise from a failure to merge sequence contigs from overlapping clones. Alternatively, erroneous assembly of closely related sequences from nonoverlapping clones may underestimate the true frequency of such features, particularly among those segments with the highest sequence similarity. Accordingly, we adopted a conservative approach for estimating such duplication from the available draft genome sequence. Pericentromeres and subtelomeres. We began by re-evaluating the finished sequences of chromosomes 21 and 22. The initial papers on these chromosomes 93,94 noted some instances of interchromosomal duplication near each centromere. With the ability now to compare these chromosomes to the vast majority of the genome, it is apparent that the regions near the centromeres consist almost entirely of interchromosomal duplicated segments, with little or no unique sequence. Smaller regions of interchromosomal duplication are also observed near the telomeres. Chromosome 22 contains a region of 1.5 Mb adjacent to the centromere in which 90% of sequence can now be recognized to consist of interchromosomal duplication (Fig. 30). Conversely, 52% of the interchromosomal duplications on chromosome 22 were located in this region, which comprises only 5% of the chromosome. Also, the subtelomeric end consists of a 50-kb region consisting almost entirely of interchromosomal duplications. Chromosome 21 presents a similar landscape (Fig. 31). The first 1 Mb after the centromere is composed of interchromosomal repeats, as well as the largest (.200 kb) block of intrachromosomally duplicated material. Again, most interchromosomal duplications on the chromosome map to this region and the most subtelomeric region (30 kb) shows extensive duplication among nonhomologous chromosomes.
Get the Genesis
of Eden AV-CD by secure
internet order >> CLICK_HERE
Windows / Mac Compatible. Includes live video seminars, enchanting renewal songs and a thousand page illustrated codex.