Welcome to the ScienceDuo blog by Chris Wallis and Rhiannon Morris. Screeds on science and sanity from two people who understand neither.
Ok so I’ve been meaning to get around to this for a long time, it’s a large and detailed subject and I didn’t want to superficially skate over it. There has been heated debate, both in the media and among experts about the idea of junk DNA, whether it exists and what proportion of the human genome is made of it. A few weeks ago we posted a poll on twitter to see what our science twitter friends and the wider science twitter community felt about the topic. The results showed that about 65% of people who answered think that the majority of the human genome is functional, that is there is very little or no junk. We also had an interesting and generally cordial discussion and some big players even posted a few arguments.
I’m not totally sure why this topic elicits such strong emotions and heat, maybe the idea that most of our DNA is junk is an affront to human exceptionalism, or maybe that it’s just a useless, dated phrase that has been rendered moot by recent research in genomics and molecular biology, and thus should be expunged from the lexicon? Either way, it’s still fascinating and ongoing, so I’d like to sketch out the arguments in favour of junk DNA and address some of the arguments based on ENCODE and other studies that claim to have written the eulogy for junk DNA. Most of this post is based on papers by Dan Graur, Alexander F. Palazzo and T. Ryan Gregory (linked at the end) which I encourage all to read.
What is junk DNA?
The phrase “junk DNA” is generally attributed to the great Susumu Ohno (1972) who pioneered work on molecular evolution. The term was used to describe a gene that had been duplicated and then subsequently destroyed or rendered inactive by mutation (the original can still function) these are also called pseudogenes. Since then, the term seems to have expanded to describe many other defective bits of DNA that once had a function but have since been degraded by mutation, these include viral DNA sequences and transposons, duplications, indels and long repetitive elements.
Advances in molecular biology and genomics including the sequencing and annotation of the human genome have revealed that a huge proportion of the human genome and other vertebrate genomes are comprised of these elements. Larry Moran at Sandwalk made a nice tentative breakdown of the human genome, which looks like this:
Junk in Your Genome
Transposable Elements: (44% junk)
Viruses (9% junk)
Pseudogenes (1.2% junk)
Ribosomal RNA genes:
Other RNA encoding genes
Protein-encoding genes: (9.6% junk)
Introns sequences account for about 30% of the genome. Most of these sequences qualify as junk but they are littered with defective transposable elements that are already included in the calculation of junk DNA.
α-satellite DNA (centromeres)
Intergenic DNA (not included above)
Total Essential/Functional (so far) = 8.7%
Total Junk (so far) = 65%
Unknown (probably mostly junk) = 26.3%
So now we have a vague idea of what we mean by junk DNA and roughly how much there is of it in the human genome. Note that the functional non-coding DNA such as tRNA rRNA siRNA, also structural non-coding DNA, promoters, enhancers, termination, telomeres and centromeres have been known to researchers since the 50’s and 60’s, something to keep in mind next time you read an press release saying that non-coding DNA was previously dismissed as junk.
Now I’ll give some arguments as to why we think that junk sequences really are junk and not just functional DNA we haven’t thought of yet.
Nothing in Biology Makes Sense Except in the Light of Evolution– Theodosius Dobzhansky
Nothing in Evolution Makes Sense Except in the Light of Population Genetics- Michael Lynch
As indicated above, like most things in biology, understanding the origin and structure of the human genome requires looking at evolutionary theory and population genetics. These describe the way genomes change with time and the forces that shape that change. One key insight into molecular evolution came from Kimura, Ohta, King, and Jukes, who developed the neutral, or near-neutral theory of molecular evolution. Briefly, they showed both mathematically and empirically that alleles that were either slightly beneficial or deleterious behaved like neutral alleles, that is, they accumulate mutations at the background rate of mutation and segregate according to genetic drift. This holds as long as the selection coefficient on these alleles is smaller than the inverse of the population size. In other words, slightly harmful mutations are invisible to natural selection if the population is small enough and they move through a population and accumulate according to random genetic drift.
The relevance of this to junk DNA is that most of the human genome (~90%) accumulates mutations in this way and that the effective historic size of the human population is small, close to 10’000 which means much of the genome changes unnoticed by natural selection, including viral insertions and other indels. These change the size of the human genome, usually by making it larger than it needs to be.
Other research on genome conservation has largely confirmed the predictions of neutral theory. Current estimates looking at comparisons of many related mammalian genomes have shown that about ~9% of the human genome is under some selective restraint, with 5% being highly conserved and another 4% being conserved in a lineage dependant manner. The rest can be assaulted by random mutation with little effect.
Long before we were able to sequence genomes it was well known to geneticists that there is a limit to the number of harmful mutations a species can sustain per generation before mutational meltdown would drive the species to extinction. This rather obvious notion is known as the genetic or mutational load. Experiments on Drosophila and other organisms have shown that if the mutation rate is higher than the power of selection to remove them, the genome deteriorates. The mutation rate thus puts an upper limit on the size of the functional genome, the more functional DNA you have, the more risk you have of suffering a harmful mutation. This number is roughly 1 harmful mutation per generation. Estimates of the human mutation rate (total new mutations) is between 70-150 new mutations per child, really quite high. Using these numbers, researchers initially calculated that this would allow only 1% of the human genome to have a sequence specific function. However, newer and more sophisticated computational methods have shown that humans can in fact probably sustain between 2-10 deleterious mutations. This brings the functional fraction of the human genome up to about 10% which agrees very well with genome conservation and neutral estimates of function. New and still more powerful and detailed methods may push that number up in the future, but it would be truly astonishing, if not impossible to push it up to the estimates from ENCODE, which is 80%
Eukaryotic transcription is inherently noisy
Non-evolutionary evidence from molecular biology and biochemistry also gives us good reason to doubt that most of the human genome is functional. For example it is well known that the majority of the genome is transcribed, which is often touted as evidence of function. However, this says nothing about the abundance of transcripts, which is important and is also another thing known to correlate with sequence conservation, a good proxy for function. Well conserved sequences tend to have much high levels of corresponding transcript abundance, the majority of the genome however, is transcribed at very low levels, often less than one molecule per cell. For example, only 1000 long non-coding RNAs (lncRNA) out of a putative 5’400-53’000 are even by ENCODEs estimates, found to be present at levels higher than one molecule per cell in the human cell lines tested. Palazzo (paper linked below) also notes that most of these transcripts are rapidly degraded and many more are likely miss-annotated mRNAs. I should note that there are known exceptions to this, many enhancer RNAs and other scaffold like RNAs perform their function at very low levels, but they seem to be in the minority.
Other databases, such as LNCipedia have compiled a list of ~21,000 human lncRNAs, with an average length of about 1kb, which represents <1% of the human genome, and only 166 of these have been experimentally verified to have a biological function. The FANTOM5 project also lists a further 43’000 so called eRNAs, which with an average of ~250bp would comprise a further 0.34% of the human genome, if all were functional. All very small numbers!
Finally, our understanding of the biochemistry of transcription shows that while RNA polymerase enzymes prefer to initiate transcription from designated start sites, it still has a low probability of initiating transcription from any open DNA strand (not bound by histones), in fact, most nucleosome-free DNA is transcribed in vivo and that even random stretches of DNA are capable of initiating transcription by recruiting transcription factors.
The Onion Test
I think if there is one idea that intuitively makes the weird fact of junk DNA more palatable it is the onion test. Firstly, it’s important to note that among eukaryotes there is an astonishingly huge range of genome sizes, with a 7,000-fold range between animals and even a 350-fold range within vertebrates. This massive difference also doesn’t scale well with any obvious notions of complexity or evolutionary rank (the C-value paradox.) The classic example is that some amoeba have more than 10X more DNA than humans, others include pufferfish which have genomes 8X smaller than humans, while lungfish genomes are 40X larger than humans. Even closely related species with very similar biological properties and the same ploidy level can differ significantly in genome size. These facts about genome variation make us consider an important question: if most eukaryotic DNA is functional, why does an onion need 5X more of it than we do? Any explanation you find attractive, be it spacer DNA, 3D topography and tertiary structure, loops, long range interactions, mutational buffering etc. All these needs to pass this simple test. The comparison holds across all eukaryotes, not just onions and humans. To quote Ryan Gregory:
“In summary, the notion that the majority of eukaryotic noncoding DNA is functional is very difficult to reconcile with the massive diversity in genome size observed among species, including among some closely related taxa. The onion test is merely a restatement of this issue, which has been well known to genome biologists for many decades”
What about ENCODE?
In 2012 the ENCODE consortium published a number of papers, one of which was signed by all members that declared that their experiments had assigned a biochemical function to 80.4% of the human genome. This astonishing result has been trumpeted as the final blow to junk DNA by everyone from genome scientists in the ENCODE ranks, science and popular journalists to creationists and just about every other internet quack. I won’t spend too much time here and I would direct the reader to Dan Graur’s paper for more detail.
Firstly, it’s important to discuss the word “function” in biology. ENCODE define a functional element as “a discrete genome segment that produces a protein or an RNA or displays a reproducible biochemical signature (e.g., protein binding.)” According to ENCODE for a sequence to be considered functional it must be transcribed, associated with modified histones, be located in an area of open chromatin, bind a transcription factor, or contain CpG dinucleotide for methylation. In his great (and savage) critique of the ENCODE results Dan Graur notes that “these properties of DNA do not describe a function; some describe a particular genomic location or a feature related to nucleotide composition.”
Graur also notes that in the biological sciences there are two ways to think about function. One is the “selected function” which is the role a particular biological feature has been selected for and maintained by evolution. The second is the “causal function” which basically states that the function of some feature is what it does or effects. For example, imagine two small sequences, ATATAAA and ATACAAA. The first binds a transcription factor (TF) and upon binding results in transcription of a downstream gene. The second does not, but a mutation changes the second sequence to ATATAAA, now the second sequence also binds the same TF, but this does not result in any change in gene expression. The first example is clearly functional, selection has maintained it over generations because it regulates transcription of some gene, while the second has not been selected for anything, it just happens to bind a TF. Is the function of the second sequence to bind TFs just because it does? Under the causal definition, used by ENCODE, the answer bizarrely, is yes. It’s like saying that the function of chewing gum is to stick to shoes. The point I’m trying to make is that genomic analysis divorced from its historical and evolutionary context is meaningless. Is the function of the human brain to fill the skull, or to use glucose? Obviously these are things the brain does, but I doubt many would argue that its selected function is to do those things, even if its causal function is.
Graur goes on to mention that detecting selection is not always easy, recently evolved genes may be under very weak positive selection for example, but this is not a good reason to ignore selection all together. So far, selection, while conservative, has been shown to be the best indicator of function.
So in conclusion, there are a number of positive arguments for most of the human genome being junk that are not based on ignorance but of decades of accumulated work in genetics, biochemistry, molecular biology and population genetics. These arguments and other important evolutionary factors regarding population genetics, sequence conservation, selection and functionality have not been addressed by ENCODE or its sycophants. Below are links to the three papers mentioned in the post and please feel free to tweet Rhee or myself or leave a comment below.