Understanding the human genome to treat cancer

Cancer biology has really revolutionised the way we think about the cell as a complex mechanism because it was the first discipline in which a very large amount of data was collected from many different patients.
Andrea Califano

Founding Chair of the Department of Systems Biology

02 Sept 2025
Andrea Califano
Key Points
  • Understanding how the epigenetics of the cells were complemented by gene expression and mutation gave us a set of data that fostered the development of the field.
  • Biologists are using the term “big data” to indicate anything that doesn’t fit into a spreadsheet. It’s not big, it’s medium-sized, but manual inspection is no longer effective. We need very powerful computational algorithms to make sense of it.
  • One of the biggest problems in human biology today is understanding the concept of overregulation – how cells are able to sense something.
  • The models that we build today essentially try to model the entire complexity of the human proteome and the human genome.

 

Mapping patient data

Cancer biology has really revolutionised the way we think about the cell as a complex mechanism because it was the first discipline in which a very large amount of data was collected from many different patients. There is a landmark initiative called the Cancer Genome Atlas, which has probably done more to help the discipline of systems biology develop and mature than any other initiative by the NIH. The Cancer Genome Atlas collected tissue from tumours of tens of thousands of patients and then characterised those tissues using genomic-level information (the mutations that were present; copy number information) – the pieces of the genome that were copied many times over or deleted in the genome of a particular cancer cell – and epigenetics, which reflects modifications that are on top of the DNA.

Biochemistry test. Photo by Roman Zaiets.

For instance, the chromatin, which is the large molecule that forms a chromosome, can be tightly wrapped up in certain areas, but it can be open in other areas. And guess where the genes are being expressed? In the areas that are open. The genes that are in the areas that are closed are not able to be expressed, and that’s how cells differentiate. In tumours, you kind of recapitulate the entire human development. And so just understanding how the epigenetics of the cells were complemented by gene expression and mutation, etc. gave us a set of data that fostered the development of the field.

Developmental biologists are now starting to generate the same type of data from all sorts of different layers of lineage development in the human body or in the body of model animals. Many disciplines to do with neurobehavioural or neurodegenerative diseases are starting to generate the same wealth of information. But cancer was the first one. And thanks to those efforts, we were able to start developing the very first framework for the computational analysis of biological data.

Biology and “big data”

Biologists increasingly use the term “big data” when they refer to the data that they’re generating. It’s not big. It’s medium-sized. Big data is what is collected by the Hadron Supercollider, which studies particles collecting more data in a week than biology has ever collected, or by the exabyte supercomputing effort, to collect information about potential extraterrestrial life. They are generating data at such a tremendous rate that you actually have to decide on the fly which data to throw away and which data to collect. So, the entire data that the first TCGA effort has generated today amounts to around 2.5 to 3 petabytes. That’s a lot of data. Biologists are starting to use the term big data to indicate anything that doesn’t fit into an Excel spreadsheet. Today, the data is vast enough that manual inspection is no longer effective. We need to rely on very powerful computational algorithms to make sense of it.

Why did we change focus to move away from pure genetics?

There are a couple of moments that changed the focus from pure genetics to studying mechanisms of regulation that involve protein and gene expression, gene regulation. The difference between what they did and what is done today is that their model was a very, very simple model that involved a handful of genes and proteins. The models that we build today essentially try to model the entire complexity of the human proteome and the human genome. It’s just a difference of scale. The principles remain unchanged. One of the biggest problems in human biology today is understanding this concept of overregulation – how cells are able to sense something. There’s something to change, something else that is involved in the processing of that initial something.

So, if you have an excess of a nutrient, you want to make sure you start producing the enzymes that can metabolise that nutrient. If the nutrient stops, you want to make sure that your entire system reacts to reduce the production of those enzymes because they’re no longer needed. So, it is an incredibly optimised system and it’s in this optimisation that we can find the foundational rules that are helping us to understand biology at another level. We call it a cancer paradox, but it’s in pretty much every human disease and, in fact, in every human phenotype.

How signals come together

For a long time, people studying height in populations have looked for the gene that gives you differential height. But such a gene doesn’t exist. If it did exist, there would be tremendous potential for disaster because it would no longer work in the context of a very highly regulated framework. There are up to a thousand genes whose variants in the human genome give millimetre advantages when you put them all together. Now, you can have a huge variation in the height of the individual itself.

So, this idea of genetics contributing to a phenotype is very important because you need to explain how it is possible that in these thousands of variants that we are observing in the human genome in complex human disease, the effect is mediated by hundreds of thousands of genes. We need to understand how the signals coming from all these genes come together, because what you’re observing in the end is a change in height. So, there’s got to be, somewhere, this contribution by all these genes that are coming together. That has been the foundational question that my lab has been asking for a long time.

We’ve done it in the context of cancer, where within a single cancer you can have a billion different configurations of mutational patterns giving exactly the same phenotype – the same transcriptional state of the cell. At some point, they have to come together. Understanding how the signals are integrated is the foundational gestalt of what we do in my lab.

DNA: the space of what could be and not what is

DNA is like a big dictionary where you store all the English words. The question is how to use the English words to put a sentence together. There are rules of syntax. You may also find them to some extent in the DNA – you may have certain proteins that interact together because they have compatible structures that you can study – to a large extent, at the level of DNA. But then there’s the creativity of the writer who puts the words together. You can say I am. Those two words go together very frequently, but then you can use I am in the context of a very complicated sentence and that sentence is not written in the DNA.

False colour TEM micrograph of a B-lymphocyte at the onset of its activation. RER cisterns (red) begin to appear In a cytoplasm (brown) full of ribosomes. Chromatin (blue), nucleolus (green). Photo by Jose Luis Calvo.

So, the idea is that it’s really the state of the chromatin. Whether it’s open or closed, it’s the state of all the signals that the cell receives from the outside environment. And it’s the state of all the different variations and effects that occur within the cells that in the end determine what the cell is doing. When the cell responds very quickly to a change in nutrients, in the end it will have to change the ratio of expression of the genes that encode for the proteins that can now deal with those nutrients. You cannot see that effect written inside the DNA. You have to go and look at how the things that are encoded in the DNA are then turned into proteins and how those proteins interact.

The reason we say that the DNA is the space of what could be is because all of the cells in our body have exactly the same DNA and yet they do very different things. The thing that has been most closely linked to the actual state of cells, in terms of this functional state – what the cell is actually doing – is its expression. The repertoire of genes that are expressed or not expressed is called the transcriptional state of the cell.

Modelling cells as an information exchange

Transcription factors are a special class of proteins. There’s only a relatively small number of them – about 1,500 – and there’s another class of protein called cofactors that help the transcription factor to do its job. But this is a very restricted class of proteins whose job is to bind the DNA, literally attach themselves to the DNA, and regulate whether the information that is stored in the DNA is now turned into a messenger RNA and then into a protein. So, these are, if you want, the intermediates.

Think of it as an army. You would have the generals and then you would have the colonels, and these are the sergeant-level proteins that tell the DNA which proteins to go make and which proteins not to go make. So, they’re very important. One of the things my lab has discovered is that in a space where everything seems to change, for instance in the case of cancer patients that have a diverse range of mutations, the actual transcriptional state of the cells can be very similar. Why don’t they give us very different transcriptional states?

Treat the block, not the individual protein

These transcription factors integrate the information that comes from the mutations and turn that into very precise programmes that are either turned on or off. They don’t work in isolation. They do a lot of different things. One transcription factor can regulate up to 16,000 genes – there are only 21,000 genes in the cell. The transcription factor becomes specific because, in many cases, it works as a module with other transcription factors and they’re so tightly auto-regulated. These modules like the guts of that air conditioner where all that information is processed. All of these proteins regulate one another such that they act as a single unit. They don’t act independently; they act as a single block, which is essentially turned either on or off. So, you don’t have to work on every single transcription factor in that block to turn it off – sometimes it’s enough to get one or two of them for the entire module to switch off. This is very important because otherwise biology would become intractable. It would become so complicated that we would no longer be able to figure out how things work. But now that we understand these modules of transcription factors, we can focus on understanding the programmes they regulate. And then we can go upstream of these modules to figure out the rest of the biology that induced their apparent activation or their normal physiological activation.

The user manual of the cell

DNA sequence by Gio.tto.

The way in which biologists have long looked at understanding cellular behaviour has been by taking the inventory of all the genes that comprise our genome and then asking whether, in a certain particular set of patients that are affected by disease, those genes had more mutations in one group versus another.

Think about having a lot of different broken watches and a lot that function. I’m talking about old-fashioned watches that have pieces working mechanically inside them. It would be like opening up all these watches, taking them apart, throwing all the gears, pulleys and components into a big bin and then rummaging through that bin for all the watches that are broken and all the watches that are not broken and asking, is this gear broken? Do I find a particular type of gear broken more frequently in the watches that don’t work versus those that do? And then you can restrict yourself and say, oh, I want to look at all the other watches that are becoming slower and are losing time, or I want to look at the watches that don’t work at all. And you don’t know how the gears are working together. You need thousands of watches that you’ve broken to statistically identify the differences in the gears that are broken in the watches that don’t work and in the gears in the watches that do.

What can the watchmaker teach us?

The watchmaker knows the blueprint and can follow the trail of all the things that make the watch work to figure out exactly what piece is broken and how it needs to be repaired or replaced. This is only possible if you have a user manual of the watch, where you can actually trace the function of the watch to the individual pieces that work together to make it happen. This didn’t exist when we started in this field.

Now, we have built assembly manuals for literally hundreds of different cell types, and we now do it increasingly on the single-cell level so we can even build these assembly manuals for the cells of an individual patient. This is helping us to understand genetics on an entirely new level, because instead of having to look for things that could occur, we can actually look for things in a single patient that are broken in the specific set of mechanisms that are responsible for the biology of that particular tumour or that particular diabetes or that particular neurodegenerative disease.

Discover more about

cancer cell modelling

Obradovic, A., Vlahos, L., Laise, P., et al. (2021). PISCES: A pipeline for the Systematic, Protein Activity-based Analysis of Single Cell RNA Sequencing Data. Preprint from , 22 May 2021.

Forrest, A., Kawaji, H., et al. (2014). The FANTOM Consortium and the RIKEN PMI and CLST (DGT). A promoter-level mammalian expression atlas. Nature, 507, 462–470.

Carro, M., Lim, W., Alvarez, M., et al. (2010). The transcriptional network for mesenchymal transformation of brain tumours. Nature, 463, 318–325.

0:00 / 0:00