Information and Biology

This is an idle muse on information and biology as I wait for my SFO to Burbank plane (also an experiment in “fast blogging”).

Biology is truly an information science – what are biological systems? They are way more than the atoms that make them up; they are far more than just the molecules that make them up; ultimately they are remarkable systems which can harness the inevitable flow of energy towards heat to their own persistence and, in many animal’s case, information capture and decision making.

And to study biology we absolutely need information science. Our theories are useful (eg, evolution) and true but not predictive in the same way as (say) gravity or the QCD in physics. To understand biology there is a huge amount of measurement, ie, data gathering, but to make sense of that data, we need computers to basically enhance our own capabilities in owning that data. There are practical aspects – humans are not good at the sort of immense book keeping one needs to trawl through datasets – but there are more profound aspects; human pattern recognition is rather cranky, often visualisation centric and is over-eager to find a pattern; computers (when programmed right) are far more level headed.

I remember distinctly the first time I realised this; in a small, cold room in Cold Spring Harbor, aged 19, in 1993, I had written a frameshift tolerant profile-HMM to DNA sequence program – this would later become “Pairwise and Searchwise”. The first profile I ran was my favourite (and still favourite!) protein domain, the RRM. And as the results came back I was both amazed at the number of hits, but rapidly started to worry my program was just identifying garbage – many of the hits did not have the canonical motif to RRMs of [F/Y]x[F/Y]xxF in amino acid code.

Thankfully I didn’t entirely trust my own opinion; I looked carefully at the resulting hits and it dawned on me that profile HMM had taken the data and found something deeper. That in fact the motif was xUxVxF, with U being I/L/V (sometimes F), and V being mainly V. The previous two FxF motifs was in fact a subset of this family, interleaved with the U position between the two Fs. This was in fact the central beta sheet, with an even pattern of hydrophobic residues. This lead to my second, large paper a year later (I was a precocious undergraduate student).

What I learnt was that I could trust the computer – or rather trust it at least as much as trusting my own opinion; the computer had “crunched the numbers” on the input profile HMM and found a different pattern to me; by iterating between the computer and myself “we” had made a discovery. Of course, I had programmed the computer, so the discovery somehow did lead back to me, but there was simply no way I could have done this unaided.

This motif of extending the human mental processes to tackle unfeasible-for-a-human tasks is now routine in molecular biology – DNA/ genome assembly was painstakingly done by hand in the 1980s / early 1990s (I remember one colleague  sliding pieces of cut out sequence past each other) – you’d be bonkers to do this “by hand” today (it’s an amusing thought!); decoding single cell experiment is unfeasible outside of a computer; modelling Xray diffraction patterns or fitting/rotating EM to make structural models have always needed the extension of the human mind by computers. Given the huge numbers of possibilities being “crunched” statistical methods have to go hand in hand with these computational schemes – it is strong statistics which allows us to “trust” the output

That said, computer programs have their limitations; mainly to do with things outside of the assumptions needed to make the method and statistics work; computers are still surprisingly bad at reconciling contradictory data. The number of oddities, exceptions and weirdness in biological systems plague systematically rolling out computational models. For a long time I thought that gene prediction should be a entirely solved  problem by computer based methods building off other evidence (transcripts; proteins). After close to a decade of watching this first hand in my Ensembl years I’ve realised that not only is there a series of “standard model doesn’t work here” (eg: Ig locus, or protocadherins) but there is a substantial and annoying amount of complex evidence reconciliation which, if you want every gene structure to be as good as possible you have to put a human in the loop. My old collaborator and co-founder of Ensembl, Michele Clamp would rage against people “hand knitting” genes – it was never going to scale – and yet our best laid plans, algorithms and large compute had a frustrating set of cases where you just had to declare X the answer and be done with it. You needed both computers and humans, working together, and of course the computers had to have humans programming and running them (in the early days we used to describe this as “babysitting” the pipeline; watching it crunch data and then it would occasionally fall over, sometimes due to infrastructure failure, sometimes due to weird data; these days the infrastructure is far, far better but the data issues are perennial in biology).

There is something interesting in the air though with these new Neural Networks – CNNs, RNNs, adversarial training. They “just” need data, even unlabelled data, and they very clearly empirically work; image recognition, nanopore base calling, chromatin states are all happy users of these. And it is frustrating that without a formal model inside these beasts, it is hard to know what precisely they are picking up. That is both unsatisfying but also has hidden dangers about generalisation – can we trust these things when they are let loose on all the complex data outside of the training sets.

People are making progress here. The rather wonderful process of reversing these neural networks can provide the “archtype” of what they are looking at (check out https://www.popsci.com/these-are-what-google-artificial-intelligences-dreams-look); with multiple composite layers one can start to have at least some sense of what composite parts are being used. And I am sure smart CS/stats people are going to continue to break this down. But ultimately I am less worried than other people about these beasts; again, humans have to find the datasets and define the objective functions of success; humans have to work out how to shape and present data in the first place, and then how to use the output. Humans still have the responsibility, end to end, of the process, just as I did in my RRM discovery, and ultimately this is just another step of extending the human abilities via computers.