What do we need to know in the Life Sciences?

Understanding how life works has been a goal of science since its inception. Many scientists do this for intellectual curiosity – the desire to simply understand and know the natural world around us. Other scientists are driven by the application of knowledge to different areas – applications of human health, agriculture, and the care of the environment.

Life is based ultimately on the chemistry of molecules, but shaped by evolution over billions of years into such staggering feats of organisation that the chemistry of these molecules can have these thoughts, transcribe them and share them with you via the chemistry of your molecules. It is quite remarkable. Necessity requires us catalog these molecules and chemistry, and conceptualise them into the key parts of these complex systems we call organisms – and so we have genomes, RNAs, proteins, organelles, cells, tissues, organs, physiology and individual organisms. These organisms interact to gain energy, matter, reproduce and live in ecosystems. Although life is “just” chemistry, the organisation of this chemistry is ultimately about the control of information over time – control such that one can reproduce the similar organisms in the future, and furthermore, control in humans and other selected species of ideas which we can transmit between individuals.

There is a rich vein of philosophy to mine here – what is life? what are the key features of life that distinguish it from other types of chemistry? (I know my colleague Alvis Brazma is writing an excellent book on this). But I want to focus on a different thread – what do we need know? How big is the knowledge we need to have to understand life?

Catalogs and Mechanism

I will take a simplifying and undoubtly only partially correct viewpoint. There are two broad types of knowledge in the life sciences we need – catalogs and mechanisms.  We need catalogs of things – ultimately these are catalogs of atoms in specific (though sometimes hard to define) configurations but we nearly always form higher level reasonably robust concepts that are many times higher in scale; the concept of a “cell” is one such thing to catalog. Cells – membrane bound collections of biological molecules – are clearly a useful and robust object in much of life – immune cells, hepatocytes, fibroblasts, keratinocytes are useful, recognisable pieces of organisation and themselves building blocks for further catalogs. The concepts of life though are not as neat and tidy as the periodic table of elements – how does the early drosophila embryo with its syncetium and gradients map to the concept of a cell? Is the membrane bound sendai virus “a cell” and if it is not, what is the different to the red blood cell? We should be comfortable that in biology there are not tidy edges to our useful concepts – biology is allowed to leverage any aspect of chemisty to its own end – and these frayed edges in the conceptualisation of biology is part of our science – to be celebrated not hidden and certainly not invalidating the core concepts.

            But catalogs by themselves are nowhere near complete; the catalogs do not by themselves tell us how life works. For that we need mechanisms. Mechanisms are how things interact and potentially change. Ultimately this is via chemistry – rerrangements in the configuration of atoms which are thermodynamically favoured. But just as with the catalogs the mechanisms we are talking about require higher level conceptualisation just for us to manage the complexity of life. Consider the action of the ribosome, tRNAs, tRNA transferases and mRNAs. This amazing, elegant collection of molecules follow thermodynamic rules (mainly due to the release of free phosphate driving the thermodynamics) which means that the codons in the mRNA produce a specific protein at near 99.99% accuracy. It is remarkable. In theory it is one massive chemical reaction which one can describe as multi-step chemistry (indeed, some noble people have attempted to get to reasonably complete chemical descriptions). But we have to have a concept for this which abbreivates it – “translation” – and this concept is robust enough that we code the logical consequence of this chemical reaction (mRNA translated to the protein). This logical, conceptual mechanism is so well understood it is instantiated in thousands of pieces of programming code around the world without an explicit link to the underlying chemistry. Just as with the catalog concept, mechanism has frayed ends – selenocystines and frame-shift read through are two such key oddities in translation – non ribosome based synthesis another key oddity; but these oddities do not somehow invalidate the core concept. And mechanism can be very large scale – the migration of cells during development, or the actions of cells and neurons to achieve homeostasis in circulation in a vertebrate, or the interplay between the commensual gut bacteria and the host cells in digestion, or the social behaviour of groups of individuals. All this I place in mechanism.

Having produced this top level taxonomy of knowledge for biology, we can now list out our needed catalogs and needed mechanisms to have mastery of life. I do not claim this list is complete; I do claim this list is necessary.

(final editorial note; this sort of list is … hubris to attempt to write! Some of these fields I am a genuine expert; some I am an onlooker with a professional interest; some I am nothing more than an armchair amateur. I look forward to the inevitable comments which will help improve both the list and the phrasing)


Species. We need a catalog of all living species

Genomes. We need to know at least one instance of the genome of every species on earth.

Genome Products. We need to know, or have the ability to accurately predict, all the products of each genome, including RNA molecules and protein molecules. The catalog should contain all potential post transcriptional (RNA) and post translational (protein) modifications

Genome Regulation. We need to know, or have the ability to accurately predict, all the points where other molecules (often protein or RNA) interact with the genome.

Protein and RNA structure. We need to know, or have the ability to accurately predict, all the atomic configuration of proteins and RNAs which have relatively stable configurations. For unstable configurations we need a useful description of the feasible configurations. These structures should include all assemblies and complexes.

Non-genome encoded molecules. We need to know all the chemicals present in the cell and their modes of production.

Sub cellular structures. We need to have a catalog of sub cellular structures and ways of understanding the distribution of all types of molecules between them.

Cells and tissues. For every species (ideally; more realistically, every species of high interest) we need to know every cell type and at least one feasible configuration of cells into tissues in a living organism (for C. elegans there is only one configuration, remarkably; for many other species one has to have at least one feasible configuration).

Organs and anatomy. For every species (ideally; more realistically, every species of high interest) we need to know how the tissues with their constituent cells form organs and anatomic structures to make an organism.

Neuronal anatomy. Neurons and brain anatomy is different enough with the axons, dendrites, spines and connections to deserve its own set of concepts (listed above) and own catalog of the set and interaction of these concepts.

Idealised Ecosystems. For every ecosystem of interacting species we need to know the types of species, their numbers and idealised position in a manner which is useful to understand the ecosystem (for example, the presence of symbosis or conflict, of prey/predation, of location relative to each other).

Global ecosystem. For the entire planet we need a catalog of ecosystems and their locations, including human created ecosystems, with appropriate models of transitions.


DNA to RNA. Not merely transcription (well described) but when and where transcription happens. We need to have the mechanism for every RNA product what conditions cause its production.

RNA to Protein. Translation. We need to have the mechanism for the production of proteins from each RNA products

Protein and RNA to 3D structure. The classic “folding problem”. This is looking more tractable than it has done in a while.

Transformation of other molecules. All the transformations, and when they happen of the other molecules in the cell. Basically, metabolism.

Sub cellular trafficking and structure. This is everything from organelle management to the 3D structure in the nucleus.

Cellular decision making. How and why do cells make decisions? Which molecules have to be present and in which configurations for different decisions?

Development. How does each cell come to its final destination and configuration from the fertilized zygote

Tissue decision making. How and why do collections of cells make decisions?

Organ function, decision making and homeostasis. How does each organ operate? How is its function kept in an appropriate stable or responsive manner?

Neuronal behaviour. How do collections of neurons behave to result in decisions. This is a big topic, and I am tempted to split it into low level circuits and larger emergent properties.

Individual behaviour. How do individuals behave (from commenusal bacteria to host interactions to con-specific interactions) in isolated interactions.

Ecosystem behaviour. How do collections of individuals across many different species behave.

For both catalogs and mechanisms, ultimately we will not be able to describe these and just use our brains to remember them – we will need to publish them, share them and, above all, store them appropriately in databases. For the catalogs this is an obvious necessity – humans do not do well at this scale of enumeration and there is little point in trying to know all these things individually (though some of these things are more countable and memorable than one realises – dedicated curators will know a surprisingly large number of genes for example).

The concept of databasing mechanisms is in its infancy. Schemes such as BioModels have the ability to store some of these. Others are published and transmitted in an almost oral history. Some of these are held in specialised structures in model organism databases (for example, the development of C. elegans), but this is more in its infancy.

The sheer complexity of the above list, and its ultimate destination in databases (as well publications to explain the concepts in the databases) shows the task we need to be prepared for over the coming centuries. I started to annotate each one about the level of completeness, but realised that in itself was a complex task, and a task that often can be broken out into a matrix of catalog vs species, and mechanism vs species – just to enumerate this task we need a database! It shows also how key the life science databases are to this endeavour; they are the ultimate point of knowledge and how we will transmit information between researchers and over time – the narrative in papers will augment, educate and explain – but the data and knowledge will be stored, maintained and used from electronic, online, openly accessible databases.