12 Genes of Christmas

This year I will speak about 12 of my favourite human genes. From tomorrow to Christmas Eve I will publish one video a day, explaining why these genes are important  Stay tuned!

Below you can find the full list of the 12 genes of Christmas for 2019.

What do we need to know in the Life Sciences?

Understanding how life works has been a goal of science since its inception. Many scientists do this for intellectual curiosity – the desire to simply understand and know the natural world around us. Other scientists are driven by the application of knowledge to different areas – applications of human health, agriculture, and the care of the environment.

Life is based ultimately on the chemistry of molecules, but shaped by evolution over billions of years into such staggering feats of organisation that the chemistry of these molecules can have these thoughts, transcribe them and share them with you via the chemistry of your molecules. It is quite remarkable. Necessity requires us catalog these molecules and chemistry, and conceptualise them into the key parts of these complex systems we call organisms – and so we have genomes, RNAs, proteins, organelles, cells, tissues, organs, physiology and individual organisms. These organisms interact to gain energy, matter, reproduce and live in ecosystems. Although life is “just” chemistry, the organisation of this chemistry is ultimately about the control of information over time – control such that one can reproduce the similar organisms in the future, and furthermore, control in humans and other selected species of ideas which we can transmit between individuals.

There is a rich vein of philosophy to mine here – what is life? what are the key features of life that distinguish it from other types of chemistry? (I know my colleague Alvis Brazma is writing an excellent book on this). But I want to focus on a different thread – what do we need know? How big is the knowledge we need to have to understand life?

Catalogs and Mechanism

I will take a simplifying and undoubtly only partially correct viewpoint. There are two broad types of knowledge in the life sciences we need – catalogs and mechanisms.  We need catalogs of things – ultimately these are catalogs of atoms in specific (though sometimes hard to define) configurations but we nearly always form higher level reasonably robust concepts that are many times higher in scale; the concept of a “cell” is one such thing to catalog. Cells – membrane bound collections of biological molecules – are clearly a useful and robust object in much of life – immune cells, hepatocytes, fibroblasts, keratinocytes are useful, recognisable pieces of organisation and themselves building blocks for further catalogs. The concepts of life though are not as neat and tidy as the periodic table of elements – how does the early drosophila embryo with its syncetium and gradients map to the concept of a cell? Is the membrane bound sendai virus “a cell” and if it is not, what is the different to the red blood cell? We should be comfortable that in biology there are not tidy edges to our useful concepts – biology is allowed to leverage any aspect of chemisty to its own end – and these frayed edges in the conceptualisation of biology is part of our science – to be celebrated not hidden and certainly not invalidating the core concepts.

            But catalogs by themselves are nowhere near complete; the catalogs do not by themselves tell us how life works. For that we need mechanisms. Mechanisms are how things interact and potentially change. Ultimately this is via chemistry – rerrangements in the configuration of atoms which are thermodynamically favoured. But just as with the catalogs the mechanisms we are talking about require higher level conceptualisation just for us to manage the complexity of life. Consider the action of the ribosome, tRNAs, tRNA transferases and mRNAs. This amazing, elegant collection of molecules follow thermodynamic rules (mainly due to the release of free phosphate driving the thermodynamics) which means that the codons in the mRNA produce a specific protein at near 99.99% accuracy. It is remarkable. In theory it is one massive chemical reaction which one can describe as multi-step chemistry (indeed, some noble people have attempted to get to reasonably complete chemical descriptions). But we have to have a concept for this which abbreivates it – “translation” – and this concept is robust enough that we code the logical consequence of this chemical reaction (mRNA translated to the protein). This logical, conceptual mechanism is so well understood it is instantiated in thousands of pieces of programming code around the world without an explicit link to the underlying chemistry. Just as with the catalog concept, mechanism has frayed ends – selenocystines and frame-shift read through are two such key oddities in translation – non ribosome based synthesis another key oddity; but these oddities do not somehow invalidate the core concept. And mechanism can be very large scale – the migration of cells during development, or the actions of cells and neurons to achieve homeostasis in circulation in a vertebrate, or the interplay between the commensual gut bacteria and the host cells in digestion, or the social behaviour of groups of individuals. All this I place in mechanism.

Having produced this top level taxonomy of knowledge for biology, we can now list out our needed catalogs and needed mechanisms to have mastery of life. I do not claim this list is complete; I do claim this list is necessary.

(final editorial note; this sort of list is … hubris to attempt to write! Some of these fields I am a genuine expert; some I am an onlooker with a professional interest; some I am nothing more than an armchair amateur. I look forward to the inevitable comments which will help improve both the list and the phrasing)


Species. We need a catalog of all living species

Genomes. We need to know at least one instance of the genome of every species on earth.

Genome Products. We need to know, or have the ability to accurately predict, all the products of each genome, including RNA molecules and protein molecules. The catalog should contain all potential post transcriptional (RNA) and post translational (protein) modifications

Genome Regulation. We need to know, or have the ability to accurately predict, all the points where other molecules (often protein or RNA) interact with the genome.

Protein and RNA structure. We need to know, or have the ability to accurately predict, all the atomic configuration of proteins and RNAs which have relatively stable configurations. For unstable configurations we need a useful description of the feasible configurations. These structures should include all assemblies and complexes.

Non-genome encoded molecules. We need to know all the chemicals present in the cell and their modes of production.

Sub cellular structures. We need to have a catalog of sub cellular structures and ways of understanding the distribution of all types of molecules between them.

Cells and tissues. For every species (ideally; more realistically, every species of high interest) we need to know every cell type and at least one feasible configuration of cells into tissues in a living organism (for C. elegans there is only one configuration, remarkably; for many other species one has to have at least one feasible configuration).

Organs and anatomy. For every species (ideally; more realistically, every species of high interest) we need to know how the tissues with their constituent cells form organs and anatomic structures to make an organism.

Neuronal anatomy. Neurons and brain anatomy is different enough with the axons, dendrites, spines and connections to deserve its own set of concepts (listed above) and own catalog of the set and interaction of these concepts.

Idealised Ecosystems. For every ecosystem of interacting species we need to know the types of species, their numbers and idealised position in a manner which is useful to understand the ecosystem (for example, the presence of symbosis or conflict, of prey/predation, of location relative to each other).

Global ecosystem. For the entire planet we need a catalog of ecosystems and their locations, including human created ecosystems, with appropriate models of transitions.


DNA to RNA. Not merely transcription (well described) but when and where transcription happens. We need to have the mechanism for every RNA product what conditions cause its production.

RNA to Protein. Translation. We need to have the mechanism for the production of proteins from each RNA products

Protein and RNA to 3D structure. The classic “folding problem”. This is looking more tractable than it has done in a while.

Transformation of other molecules. All the transformations, and when they happen of the other molecules in the cell. Basically, metabolism.

Sub cellular trafficking and structure. This is everything from organelle management to the 3D structure in the nucleus.

Cellular decision making. How and why do cells make decisions? Which molecules have to be present and in which configurations for different decisions?

Development. How does each cell come to its final destination and configuration from the fertilized zygote

Tissue decision making. How and why do collections of cells make decisions?

Organ function, decision making and homeostasis. How does each organ operate? How is its function kept in an appropriate stable or responsive manner?

Neuronal behaviour. How do collections of neurons behave to result in decisions. This is a big topic, and I am tempted to split it into low level circuits and larger emergent properties.

Individual behaviour. How do individuals behave (from commenusal bacteria to host interactions to con-specific interactions) in isolated interactions.

Ecosystem behaviour. How do collections of individuals across many different species behave.

For both catalogs and mechanisms, ultimately we will not be able to describe these and just use our brains to remember them – we will need to publish them, share them and, above all, store them appropriately in databases. For the catalogs this is an obvious necessity – humans do not do well at this scale of enumeration and there is little point in trying to know all these things individually (though some of these things are more countable and memorable than one realises – dedicated curators will know a surprisingly large number of genes for example).

The concept of databasing mechanisms is in its infancy. Schemes such as BioModels have the ability to store some of these. Others are published and transmitted in an almost oral history. Some of these are held in specialised structures in model organism databases (for example, the development of C. elegans), but this is more in its infancy.

The sheer complexity of the above list, and its ultimate destination in databases (as well publications to explain the concepts in the databases) shows the task we need to be prepared for over the coming centuries. I started to annotate each one about the level of completeness, but realised that in itself was a complex task, and a task that often can be broken out into a matrix of catalog vs species, and mechanism vs species – just to enumerate this task we need a database! It shows also how key the life science databases are to this endeavour; they are the ultimate point of knowledge and how we will transmit information between researchers and over time – the narrative in papers will augment, educate and explain – but the data and knowledge will be stored, maintained and used from electronic, online, openly accessible databases.

Why embryo selection for polygenic traits is wrong.

This week (May 20th 2019) has seen yet another splash by an American company offering a polygenic trait score on embryos including intelligence. This is wrong on a number of levels; ethically it is wrong to make this decision as an independent laboratory without broad societal buy in; scientifically it is wrong to imagine the ways we assess polygenic traits will translate into safe and effective embryo selection; for the specifics of IQ/Educational attainment trait this trait is so complex this is additionally unwise over and above any concerns.

I would not recommend it either as a member of society or as a genomic scientist. This blog aims to unpack this more.


First off it is important to realise that as science progresses in biology – and in particular reproductive biology – we develop the possibility that we can perform actions that as a society we consider wrong. There is nothing new for this; for example, ultrasound scanning allows one to reliably sex foetus early on in pregnancy; however parental choice of sex of the child is either explicitly illegal or implicitly prohibited in most locations. As we learn more about genetics, we will be able to make more sophisticated choices of what we could do, but it is important that we make the decision about what we should do as responsible members of society.

This decision has to be made using processes set up inside each society; in practice this means under national legislation. I am both most familiar with and very comfortable with the UK’s Human Fertilization and Embryo Authority scheme (HFEA). This is a statutory body set up by the UK Parliament, with a variety of lay and religious members, as well as ethicists and scientists. The UK Parliament has made some possible schemes illegal (for example, reproductive cloning) but otherwise provides considerable latitude for the HFEA to make decisions. It is important that this body is has a majority of non scientists, and when the HFEA licenses a procedure the UK can be very confident it is medically safe, scientifically sound and ethically has broad support.

Each country has to arrange their own affairs, but I think there are some principles of best practice. One is that the scientists and clinicians are not self regulating here – it needs societal buy in. The second is that it is near impossible to handle this via national laws – laws are complex to change and near impossible to write with foresight for future science.

Science of polygenic traits

I am a longstanding genomic scientist, and have broad interest across many topics in genomics and genetics. Despite my cautious enthusiasm for using the genetics of polygenic traits in other medical spheres, in particular to potentially augment our understanding of risk of common diseases in adults, I do not think it is appropriate for embryo selection or assessment, certainly not without more research and potentially not for a long time. The main reason is that we have a high potential to cause harm, and only a small potential to mitigate bad outcomes. Stepping back – polygenic traits are traits where multiple places in the genome contribute to a trait (poly = meaning many, and genic meaning genes). This is well established genetic theory and practice since the 1930s (pre-dating the discovery of DNA). However, not only are these methods inexact but we simply do not know what other features are linked to the traits – the most sophisticated models deliberately do not attempt to localise the precise genomic locations to gain more predictive power. In a situation where one is doing something quite novel (selecting embryos from in vitro fertilized embryos) one simply doesn’t know what would happen. For example, it might be for some traits there are strong developmental aspects which mean the polygenic score we select on also contributes to development defects or to other features in an adult we did not anticipate. There is a big difference between scoring an adult who is alive and well, and selecting embryos for implantations. You might think I am being paranoid, but the history of animal breeding has shown many unforseen consequences of mating strategies; for example, selection of fast growing chickens lead at first to socially inept chickens who bullied / fought with each other when grown in flocks. This was recognised and eventually a multi-variate breeding scheme was put in place, but it could only be recognised by actually trying it. Selection of embryos on polygenic scores would be an experiment, and one in which we would have true unknowns; some of those unknowns having the potential to cause serious harm.

Some commentators cite the success of animal breeding schemes using genetics as supporting polygenic trait selection of embryos. This is misguided. Despite these schemes employing genetics (and similar machinery as polygenic risk scores, known as “Breeding Values” in the breeding community), the schemes are not the same as embryo selection from a random cross; animal breeding genetics gets its main benefit from selecting mates for breeding, not on selecting embryos of a random mating. I know of no animal breeding scheme which involves embryo selection for breeding traits (although embryo selection is used in the production of transgenic animals or selection of sex in elite breeding lines, and via this blog post I have learnt more about its use in plant and animal breeding). Furthermore, as discussed above, animal genetic breeding scheme are a cautionary tale of how things can go wrong as well as go right. The difference is that “failed” breeding choices in plant and animal breeding are simply discarded – this is not acceptable for humans. Anyone who is using plant or animal breeding as justification for the success of genetic intervention in humans simply does not understand animal breeding.

A further point which I almost feel it is so obvious it is not worth making, but reading some articles it does seem necessary. The amazing ability to directly edit genomes is of no relevance to this discussion. Polygenic trait “prediction” should perhaps be better stated as “interpolation” as what is happening is that we take an individual’s genome and try to estimate its phenotype using many many previous individuals phenotype and genotype. The most powerful methods to do this deliberately do not model any specific base pair changes (it ends up being statistically more advantageous to do so as our genome moves around broadly in blocks rather than specific bases); even when we try to estimate the precise bases involved at a particular location, the “blocky” nature of human genetics prevents us from ever being sure. So, although we can steadily improve our ability to use genetics for prediction, it is not in the way of using knowing the precise changes to make, and if we ever tried to do this it would, again, be an explicit experiment for polygenic traits (for monogenic or digenic traits with high penetrance alleles there is a different argument; in those cases it is extremely hard to imagine a scenario where currently licensed pre-implantation diagnosis would not work but genome editing would work).

Intelligence and Educational Attainment as polygenic trait

The genetics of intelligence and of educational attainment (how well people do at school) is a very complex topic; nevertheless some real progress has been made in particular over the last 10 years. This blog post is not the place to unpack the complexity of this trait (unsurprisingly … it is complex) nor the validity of the genetics – I recommend work from Stuart Ritchie, Paige Harden, Alex Young, Ian Deary and Robert Plomin as a selection of researchers in this field. My summary for this purposes of this blog post is that the genetics of IQ and educational attainment are real polygenic traits, but they are the sorts of traits one should be particularly careful of thinking about for embryo selection, over and above my generic concern above, and even when one is trying to focus only on “severe intellectual disability” end of the spectrum.

There are a number of scientific reasons why. The first is that these traits are hard to estimate and the non-random environment (the fact that schooling is different in different places even in relatively homogenous environments) coupled with localised genetics means it is hard to know whether one has “scrubbed” out this effect (cryptic population stratification). Again, the potential for selecting against perfectly reasonable embryos (and performing a procedure with risks) for no gain is present. The second is that around one third to a half of the genetic signal for intelligence / educational attainment (depending somewhat on how you construct the statistics) is due to parental environment; because each person’s genetics is also a reasonable good estimator of their parent’s genetics, genetic variants which influence parenting and via this, the child’s IQ/education show up strongly. This is fascinating research (note; the same techniques that find this do not show a strong effect of parental environment on other traits, eg, blood lipids or height) but means that estimating the tails has an additional major complication in trying to isolate the true “within individual effect”. Finally there are complex interactions between some deleterious traits (eg, autism) with educational attainment (there is a weak positive correlation) meaning that this trait in particular is complex to understand.

This is before one gets into the ethical considerations about how one should handle this trait, though if one is focused on the most severe disability end, this is at least justified. However, the obvious (and wrong) snake oil position is to imagine one can rank or significantly select for the top end of a continuous scale. All the problems with the trait are valid each end; more importantly a naive view of this will be that one can select from the top or bottom of the population distribution, whereas the main determinants in embryo selection will be the genetics of the father and the mother – one is bounded by these genetics (formally – a small variation around the mean of father and mother in the models used; in practice rarely goes outside of this expectation).


It is both my position as citizen in society (in my case, the UK) that one should not use embryo selection for complex trait behaviours and it is my position as genome scientist that this would be scientifically unsound to do so for any trait, in particular for IQ or educational attainment traits. It is worth considering what would be the closest thing to this that I could endorse. In terms of the science I can imagine (and I believe is licensed now) digenic (two locus) selection, and in the future I could imagine oligenic selection. Furthermore I could imagine a case of one or a small number of loci and a polygenic background, where there is differential advising and options to parents with “bad” polygenic backgrounds for a particular disease coupled with some higher effect size loci which could be treated (in effect) as monogenic diseases in their cases. Finally I can see severe behavioural difficulties with strong genetic basis in mono or oligenic as candidates for licensing. But these are growing the scope from the existing practice, and are a very very long way away from generic polygenic trait scoring for behaviours in embryos.