Information and Biology

This is an idle muse on information and biology as I wait for my SFO to Burbank plane (also an experiment in “fast blogging”).

Biology is truly an information science – what are biological systems? They are way more than the atoms that make them up; they are far more than just the molecules that make them up; ultimately they are remarkable systems which can harness the inevitable flow of energy towards heat to their own persistence and, in many animal’s case, information capture and decision making.

And to study biology we absolutely need information science. Our theories are useful (eg, evolution) and true but not predictive in the same way as (say) gravity or the QCD in physics. To understand biology there is a huge amount of measurement, ie, data gathering, but to make sense of that data, we need computers to basically enhance our own capabilities in owning that data. There are practical aspects – humans are not good at the sort of immense book keeping one needs to trawl through datasets – but there are more profound aspects; human pattern recognition is rather cranky, often visualisation centric and is over-eager to find a pattern; computers (when programmed right) are far more level headed.

I remember distinctly the first time I realised this; in a small, cold room in Cold Spring Harbor, aged 19, in 1993, I had written a frameshift tolerant profile-HMM to DNA sequence program – this would later become “Pairwise and Searchwise”. The first profile I ran was my favourite (and still favourite!) protein domain, the RRM. And as the results came back I was both amazed at the number of hits, but rapidly started to worry my program was just identifying garbage – many of the hits did not have the canonical motif to RRMs of [F/Y]x[F/Y]xxF in amino acid code.

Thankfully I didn’t entirely trust my own opinion; I looked carefully at the resulting hits and it dawned on me that profile HMM had taken the data and found something deeper. That in fact the motif was xUxVxF, with U being I/L/V (sometimes F), and V being mainly V. The previous two FxF motifs was in fact a subset of this family, interleaved with the U position between the two Fs. This was in fact the central beta sheet, with an even pattern of hydrophobic residues. This lead to my second, large paper a year later (I was a precocious undergraduate student).

What I learnt was that I could trust the computer – or rather trust it at least as much as trusting my own opinion; the computer had “crunched the numbers” on the input profile HMM and found a different pattern to me; by iterating between the computer and myself “we” had made a discovery. Of course, I had programmed the computer, so the discovery somehow did lead back to me, but there was simply no way I could have done this unaided.

This motif of extending the human mental processes to tackle unfeasible-for-a-human tasks is now routine in molecular biology – DNA/ genome assembly was painstakingly done by hand in the 1980s / early 1990s (I remember one colleague  sliding pieces of cut out sequence past each other) – you’d be bonkers to do this “by hand” today (it’s an amusing thought!); decoding single cell experiment is unfeasible outside of a computer; modelling Xray diffraction patterns or fitting/rotating EM to make structural models have always needed the extension of the human mind by computers. Given the huge numbers of possibilities being “crunched” statistical methods have to go hand in hand with these computational schemes – it is strong statistics which allows us to “trust” the output

That said, computer programs have their limitations; mainly to do with things outside of the assumptions needed to make the method and statistics work; computers are still surprisingly bad at reconciling contradictory data. The number of oddities, exceptions and weirdness in biological systems plague systematically rolling out computational models. For a long time I thought that gene prediction should be a entirely solved  problem by computer based methods building off other evidence (transcripts; proteins). After close to a decade of watching this first hand in my Ensembl years I’ve realised that not only is there a series of “standard model doesn’t work here” (eg: Ig locus, or protocadherins) but there is a substantial and annoying amount of complex evidence reconciliation which, if you want every gene structure to be as good as possible you have to put a human in the loop. My old collaborator and co-founder of Ensembl, Michele Clamp would rage against people “hand knitting” genes – it was never going to scale – and yet our best laid plans, algorithms and large compute had a frustrating set of cases where you just had to declare X the answer and be done with it. You needed both computers and humans, working together, and of course the computers had to have humans programming and running them (in the early days we used to describe this as “babysitting” the pipeline; watching it crunch data and then it would occasionally fall over, sometimes due to infrastructure failure, sometimes due to weird data; these days the infrastructure is far, far better but the data issues are perennial in biology).

There is something interesting in the air though with these new Neural Networks – CNNs, RNNs, adversarial training. They “just” need data, even unlabelled data, and they very clearly empirically work; image recognition, nanopore base calling, chromatin states are all happy users of these. And it is frustrating that without a formal model inside these beasts, it is hard to know what precisely they are picking up. That is both unsatisfying but also has hidden dangers about generalisation – can we trust these things when they are let loose on all the complex data outside of the training sets.

People are making progress here. The rather wonderful process of reversing these neural networks can provide the “archtype” of what they are looking at (check out; with multiple composite layers one can start to have at least some sense of what composite parts are being used. And I am sure smart CS/stats people are going to continue to break this down. But ultimately I am less worried than other people about these beasts; again, humans have to find the datasets and define the objective functions of success; humans have to work out how to shape and present data in the first place, and then how to use the output. Humans still have the responsibility, end to end, of the process, just as I did in my RRM discovery, and ultimately this is just another step of extending the human abilities via computers.



Is Science right, and how do we know it?

Reflections on reproducibility, digital communication and open science

Is science sound? There has been a sustained discussion about this over the past five years – ever-present in the background, and punctuated by intense public debates, both in the scientific press and more broadly. There is a host of concerns – from reproducibility of science to incentive structures – all focused ultimately on how we know what is true and what is not. The answer is not always straightforward.

Continue reading “Is Science right, and how do we know it?”

The big reveal: Beta galactosidase and cryo-EM

My final Structure of Christmas may look like an unremarkable enzyme, but it heralded the arrival of a game-changing method in structural biology.

My ninth (and final) Structure of Christmas is beta-galactosidase: a pretty run-of-the-mill enzyme that turns compound sugars into monosaccharides. When you put a special dye on it, it turns the dye blue (whee!). It’s a mainstay of molecular biology and millions of students have used it in countless experiments, both fascinating and mundane. It doesn’t have much of a ‘wow’ factor – it’s a solid member of a respectable family of sugar-cleaving enzymes.

What is so special about it is the way its structure was determined.

Continue reading “The big reveal: Beta galactosidase and cryo-EM”

Tropomyosin and actin: Move!

My penultimate structure of Christmas is actually two molecular partners, which work together to make muscle move.

Most of my Christmas structures have been separable units – some large, some small – that float around in cells or cell membranes. But to move physically, organisms need to have more at their disposal than some things floating in solution. For most life forms, movement is managed by proteins working together. A perfect example of this is the beautiful partnership between actin and tropomyosin.

Continue reading “Tropomyosin and actin: Move!”

Antibodies: Defend!

Once a parasite makes it past our outer defences, it encounters some seriously sophisticated weaponry. One of these is the ever-shifting antibody, my seventh structure of Christmas.

Every large organism – you included – is just a feast laid out for any parasite (bacteria, virus or beastie) clever enough to break in and access its carefully amassed energy. Throughout the Billion-Years’ Evolutionary War between hosts and parasites, the host has always been on the defensive, endlessly innovating to fend off invaders.

Continue reading “Antibodies: Defend!”

Vibrio cholerae: Attack!

My sixth structure of Christmas is out to kill human gut cells, with help from a human protein. But has it simply shown up (drunk) at the wrong party?

Interactions between two living organisms nearly always involve proteins. All proteins fold into precise, beautiful shapes, tweaked and perfected by evolution over millions of years to perform very specific tasks. In a successful interaction, two of these shapes will fit together perfectly – like a plug and socket – to make things happen.

Continue reading “Vibrio cholerae: Attack!”

The twilight world between chemistry and life

Viruses live in a twilight zone, somewhere between life and its ingredients. My fifth structure of Christmas emerges from that zone to wreak havoc on cattle: the foot-and-mouth-disease virus.

Consider the virus: a beautifully crafted set of molecules perfectly arranged to do one thing, and one thing only: subvert life forms to make more of itself. But what is it? Is it ‘alive’, in the conventional sense?

Continue reading “The twilight world between chemistry and life”

RuBisCO: the lazy, needy carbon fixer

The CO2-fixing RuBisCO, a respectable representative of life on Earth, is my fourth structure of Christmas.

If a Martian visited Earth and was asked to report back on the most important protein in our biosphere, quite possibly it would choose RuBisCO. As enzymes go it isn’t the biggest, but it is a very big deal. It is extremely common – every single plant and photosynthetic cyanobacterium is stuffed full of it – and it performs one of the most crucial reactions for all of life: “fixing” gaseous carbon dioxide into sugars and amino acids.

Continue reading “RuBisCO: the lazy, needy carbon fixer”

Seeing the light: opsin

The second structure of Christmas is the membrane protein opsin, which allows us to perceive light.

Proteins that control the information going in and out of our cells are harder to crystallise than run-of-the-mill globular proteins, as they have both water-loving and fat-loving parts and are tricky to mass produce. Opsin, our second structure of Christmas, is one such molecule. It is situated on a special membrane in a specialised cell at the back of our eyes, and senses light.

Continue reading “Seeing the light: opsin”