You would probably not find a single answer among geneticists to this question, but different pieces of an answer. Most people would agree that genes are collections: an ensemble of genomic sequences made of DNA which have a coherent function and lie in a locus. A locus is a position in the genome that has been assigned and fixed on a map, like a GPS map. So, there is a map for a series of one to ten-hundred pieces of DNA that produce a functional item and that have some coherence in the production of this item, meaning that the regulation of the expression of the gene is agreed between those sequences.
When geneticists say, ‘this gene does not function’ or ‘there is a mutation in this gene’, they usually refer to a strict and limited part of this broader definition; specifically, they refer to genes encoding proteins. For these genes, the coherent function is to encode a protein through the transcription of an RNA (which we call the messenger RNA) that will then be translated into a protein.
Genes encoding proteins represent 2% of our genome, which is both very little of our genome and a huge amount of DNA, because our genome is immense. The human genome contains 3.2 billion nucleotides, which are the elementary molecules that compose DNA. And there are four nucleotides comprising the assembly of A + T and G + C, etc. in our genome. So, the restricted definition of a gene is limited to genes encoding proteins. However, the product of a gene may not necessarily be a protein; it could simply be a non-coding RNA. The number of genes producing proteins and genes producing only RNA is almost the same. We have about 21,000 genes encoding proteins and about 20,000 genes that produce RNA as a final product.
What do genes produce?
If we look at genes encoding proteins, those genes are in pieces or chapters, which we call exons. Exons are the pieces of DNA that are recapitulated in the final messenger RNA that is translocated to the cytoplasm of the cell, where it is translated into a protein. This is the classical paradigm of molecular biology: a piece of DNA transcribed into a mimicking RNA, translocated from the nucleus of the cell to the cytoplasm of the cell, and then translated into a protein that will be used. The proteins are so diverse: myoglobin, enzymes and structural proteins are just a few examples.
This is the usual dogma of molecular biology: a piece of DNA is transcribed and translated, producing a protein. Now, if we go back to history, the gene is not only recapitulated; it’s not strictly defined by these production functions alone. A gene is also a unit of transmission. Our genes have this function of producing proteins and non-proteins like RNAs, but also of transmitting our genomic inheritance to offspring. So, DNA has two functions, both horizontal and vertical.
This brings to mind the last sentence of Watson and Crick’s seminal paper in 1953, when they said it has not escaped our notice that the structure of DNA predicts its ability to be copied and transmitted to offspring cells, to the offspring of a family between two parents. This is why genes are so important as a unit of transmission, as a unit of function and production, and also for medical geneticists as a unit of mutation.
Reverse genetics
A protein encoded by genes could be mutated or absent, thereby causing a disease-related phenotype. Because of this, it has been postulated that if you could find the mutation in the DNA, you could then predict what the defect would be, as far as the protein or proteins are concerned. A gene often has several protein products, not just one; on average, one gene has four protein products.
This role of the gene as a factory of proteins or RNAs has defined molecular genetics as the specialty to discover why somebody has a genetic condition or disease by using their DNA as a proxy to find the difference, or the genomic alteration, that would predict why this patient is suffering from a disease. This has been called reverse genetics, because it’s the other way around: you have the gene products – RNA and protein – and you use DNA in the first place to discover the function of that protein.
Variations in DNA
DNA is extremely variable from one individual to the other. In a mean comparison of two individuals taken randomly worldwide, there is a 0.1% difference in DNA. So, it seems that human beings are very alike in their DNA. But if we remember that DNA is composed of 3.2 billion nucleotides, then this difference of 0.1% represents a huge amount of DNA. This means that individuals are both so different and so similar.
This brings us to the idea that all DNA has accumulated variations. Most of these variations are neutral; they do not result in phenotypes or differences, as far as protein expressions or RNA expressions are concerned. A limited number of these differences have been used by evolution to build up a better human being. We call them positive alterations, or positive mutations. When we speak of disease-causing alterations, we refer to the variations in human genomes that are deleterious, that could result in a phenotype or make you predisposed to a phenotype.
Why do DNA mutations exist?
The answer is simply because DNA has been replicated between two parents and a child. We know this in detail now, having sequenced the whole genome of two healthy parents and the genome of their kids. Just look at the difference between the parents and the offspring. There are not one or two mutations, but rather 70 to 80 new mutations for each healthy child, which means that all DNA has accumulated a huge number of variations.
Most likely, when copying our DNA into molecules for offspring, the repair system is not perfect: it’s a little loose and allows for some variation. From time to time, these variations might occur in an important location, where it will produce a disease-causing effect. So, it isn’t a deterministic mechanism introducing mutations into DNA, but rather the accumulation of variations that from time to time would result in a phenotype.
Besides this general explanation of DNA variation, natural selection and the Darwinian hypothesis of evolution, there is another explanation: some regions of DNA may be more fragile than others. For example, you might find the same structural variations in a child affected with a disease in Japan and a child affected with the same disease in South America – from different parents, unrelated cases – because that particular region of DNA is prone to recombination, duplication or deletion. But most of the time, the genomic alteration resulting in diseases occurs randomly.
Watson and Crick’s seminal article in 1953 started the complete revolution of the way we see the genome as both the factory of our cells and as the mechanism of transmission of the human genome from two parents to offspring. It also started a revolution of progress. New methodological approaches have made it much easier to sequence genes, or exomes (meaning all the genes encoding proteins), and the genome.
To give you a sense of the incredible evolution of techniques around DNA sequencing, the Human Genome Project that gave rise to the human genome sequence in 2003 took around 10 years, 10 countries and probably billions of dollars to complete. Today, you can sequence the whole genome of an individual in a week, for 500 to 1,000 euros. Our ability to use DNA as a tool for diagnostics, but also for treatment of genetic disorders, has changed completely because of methodological approaches.
The challenge of interpreting DNA data
A second turning point is the discovery of the unexpected and incredible viability of human DNA. If you consider humans as a species and not as individuals, viability is extremely important to adapt to and tolerate differences in the environments, such as infection, changes in climate and food and metabolic changes. We function according to natural selection, not as clonal humans, but as very valuable humans that can adapt.
Now, for the individual human, it’s a huge challenge. When you sequence the DNA of somebody looking for a mutation, you don’t find that mutation. You find a huge amount of DNA variation. Finding out which mutation is causing disease or predisposition to the disease is an extremely difficult task, which has been created in part by the variabilityof human DNA in the face of technological progress.
This is where we are now in human genetics. Since 1953, the number of genomic variations for which we do not yet have the answer has kept growing. I have no doubt that progress will be fast. But still, it’s a strange time in human genetics. Getting DNA data is no longer the problem; the problem is interpreting the data. As far as medical genetics is concerned, the point is not to produce a sequence but to have the right prescription for a test and the right interpretation of a test.