A long time ago I was on the edge of the debate about SNPs in the late 1990s; of whether there should be an investment in first discovering, and then characterising and then using many many biallelic markers to track down diseases. As is now obvious, this decision was taken (First the SNP Consortium, then the HapMap project and its successor, 1000 genomes, and then many Genome wide association studies). I was quite young at the time (in my mid to late twenties; I even had a earring at the start of this as I was a young, rebellious man) and came from a background of sequence analysis – so it was quite confusing I remember getting my head around all the different terminology and subtlies of the argument. I think it was Lon Cardon who patiently explained to me yet again the concepts and he finished by saying that the real headache was that there were just so many low frequency alleles that were going to be hidden and that was going to be a missed opportunity. I nodded at the time, adding yet one more confused concept in my head to discussions about r2 behaviours, population structure, effect size and recombination hotspots all of which didn’t sit totally comfortably in my head at the time.
The publication of ENCODE data raised substantial discussions. Clear, open, rational debate with access to data is the cornerstone of science. For the scientific details the ENCODE papers are totally open, and we have aimed for a high level of transparency e.g. a virtual machine to provide complete access to data and code.
5 September 2012 – Today sees the embargo lift on the second phase of the ENCODE project and the simultaneous publication of 30 coordinated, open-access papers in Nature, Genome Research and Genome Biology as well as publications in Science, Cell, JBC and others. The Nature publication has a number of firsts: cross-publication topic threads, a dedicated iPad/eBook App and web site and a virtual machine.
Last week a positioning paper by Guy Cochrane, Chuck Cook and myself finally came out in Gigascience. It’s premise is rather simple: we are all going to have to get use to lossy compression of DNA sequence, and as lossy compression is variable (you can set it at a variety of levels), we will have to have a community consensus of how much one compresses different data. This really part of our 2 year process into making efficient compression a practical reality for DNA sequencing, which I’ve blogged on before, numerous times.
I encourage you to read the paper, but in this blog I want to explore more the analogies between imaging and DNA sequencing – which now are numerous. I believe at the core biology (of all sorts) the majority of data gathering will either be optical or “nucleic”. (the third machine type probably being mass spectroscopy, if you are interested). As a colleague once said about molecular biology – the game in town is now to get your method to output a set of DNA molecules. If you can do that – you can scale.
The first question is to ask why these two technologies are so dominant. The first is that one is fundamentally trying to capture information – information about distributions of photons or information about the make up of molecules. One is not trying to do something large and physical. This means that the mechanisms for detection can be excruiatingly sensitive. In the case of imaging, single photon sensitivity is almost routine on modern instruments, and with things like Super-resolution imaging, which is a whole bunch of tricks to (in effect) in effect convert multiple time-separated images of a static image into better spatial resolution (I saw a deeply impressive talk by a Post Doc in Jan Ellenberg’s group showing a remarkable resolution of the nuclear pore). But one does not need to have fancy tricks to make great use of imaging – rather mundane, cheap imaging is the mainstay of all sorts of molecular biology – drosophila embryo development would be a different world without it. In the case of DNA, the ability to sequence at scale has been the focus for the last 20 years, and still will remain so probably until a human genome is in the ~$1,000 or $100 zone – as expensive a serious Xray or MRI. But at the same time the other shift is to moving towards more real time systems (the “latency” of the big sequencers are probably the biggest drag – 3 weeks is your best case on the old HiSeqs) and to single molecule systems. People talk about real time as critical to the clinic, and certainly the difference between even 12 hours and 2 weeks is day and night (at 12 hours or less one can do the cycle within a 1 day stay, and start to impact in-time diagnosis), but faster cycle times will really change research as well. Going back to the information aspect of these two technologies, as one is trying to only get information out of these things the physical limits of the technologies are remarkably far away. Imaging is hitting some of these limits (though there is still plenty of space for innovation); 3rd generation DNA sequencers will get closer to some limits in DNA processing as well, but again, we have some way to go. The future is bright for both of these technologies.
The second similarity is just the mundane business of storage of the output of these technologies – they are high density information streams, and therefore have alot of inherent entropy – some of that entropy one wants to utilise – that’s the whole point – but there is also quite a bit of extra “field” (in imaging terms) or “other bits of the genome” (in DNA terms) which one often knows is going to be there, but is less interested in. Imaging has long led the area of data-specific compression, using at first a variety of techniques of transformation of the data from straightforward x,y layout of pixel intensities to ways which inherently capture the correlation between pixels, allowing for efficient lossless compression. But the real breakthroughs came with lossy compression, understanding that for alot of the pixels, a transformation which lost some information for a large gain in compressability where appropriate for uses. Although the tendancy is to think about lossy compression in terms of “visual” display-to-user uses, in fact many technical groups use a variety of lossy forms for their storage, choosing mainly the amount of loss appropriately (I’d be interested in experiences on this, and in particular whether people deliberately choose other lossy algorithms away from the JPEG family). But Video compression has really taken lossy compression into new directions, with complex between frame transformations and then lossy applications, in particular adaptive modelling.
When we started in DNA compression many people critiqued it that we “couldn’t beat established generic compression” or that certain compression forms we “already optimal”. This totally misses the point – generic and optimal compression schemes are only generic and optimal for a particular data model, and to be generic, that data model involves a byte-stream. One doesn’t hear people saying about video compression “oh well, that problem has been solved generically and there are optimal compression methods” – putting a set of raw TIFFs straight into a byte-based compressor would not do very well. The key thing is first a data transformation that makes explicit correlation in the data for standard generic methods to compress (in the case of DNA, reference based alignment provides a sensible realisation of the redundancy between bases in a high coverage sample, and for a low coverage sample realises the redundancy with respect to previous samples). The second thing is the insight that not all the precision of the data is needed for interpretation. Interestingly lossy compression makes you think about the problem as the inverse of the normal thought process – often you ask “what information am I looking for” for some biological process – SNPs, or structural variants. Lossy compression methods inverts to the problem to ask “what information are you pretty sure you don’t need”. For example, when you know your photon detector will generate some random noise in particular patterns, having a lossy compression remove that entropy is highly unlikely to effect downstream analysis. Similarly when we can confidently recognise an isolated sequencing error, degrading the entropy of the quality score of the base is unlikely to change downstream analysis. I’ve enjoyed learning more about image compression, and I think we’ve only started in DNA compression – at the moment we can 2 to 4 fold compression compared to standard methods with a clearly acceptable lossy mode (acceptable because the machine manufactures sort of know that they are generating a little too much precision in their quality scores). But with more aggressive techniques we can already think about 50 to 100 fold compression – though this is getting quite lossy. But this is not the end of the road here – I reckon we could be at 1,000 fold more compressed in the future.
The third similarity is the intensity in informatics in the processing. Both for image analysis and DNA analysis there are some standard tools (segmentation, hull finding, texture characterisation in imaging; alignment, assembly, variant calling in DNA sequence analysis) but how these tools are deployed is very experiment specific. There is not some “generic image analysis pipeline” any more than there is a “generic DNA analysis pipeline”. One has to choose particular analysis routes mainly driven by the experiment that was performed, and then to some extent for the output you want to see. This means that the bioinformatician must have a good mastery of the techniques. I have to admit, although I live and breathe DNA analysis, often developing new tools, I am pretty naive about image analysis – not that that’s stopping me diving in with my students in using (but not developing…) image analysis. I think we’re not making image analysis enough of a mainstream skill set in bioinformatics, and this needs to change.
Finally the cheapness and ubiquity of imaging has meant that from the start image based techniques had to think carefully about which images one would store and at what compression. Clearly DNA sequencing is heading the same way, and this is the paper that Guy and myself put forward. Similarly to imaging, the key question is what is the overall cost of replacing the experiment, not the details of how much the image itself cost. So – a rare sample (such as a Neanderthal bone) is very hard to repeat the experiment – you need to store that information at high fidelity. But a routine mouse sequencing chip-seq is far more reproducible and one can be far more aggressive on compression. I actually think it has been to the detriment of biological imaging that there has not be a good, reference archive – probably because of this problem is knowing which things it is worth archiving coupled with the awesome diversity of uses for imaging – but projects like EuroBioImaging I think will provide the first (in this case federated) archiving process.
Over the next decade then I see ‘imaging’ and ‘dna sequencing’ converging more and more. Time to learn some image analysis basics (does anyone know a good book on the topic that geeky and detailed but starts at the basics?
It is not often anyone will hear the phrase “Galois field” and “DNA” together, but this paper from my colleagues, Tim Massingham and Nick Goldman provide a great link between these topics. Some other authors have used Galois fields in DNA analysis, but this is the first time I have seen a practical application of this level of mathematics in bioinformatics. It’s a tour de force by Tim, and although only in a lowly BMC Bioinformatics journal I think should be celebrated for its sheer chuptaz in cross scientific – indeed academic – domains.
I’ve just finished reading the excellent book “Thinking, Fast and Slow” by Daniel Kahneman, who is a psychologist who had a profound impact on economics; he won the Nobel Prize in economics in 2002 for “Prospect theory”, which basically tries to provide a reliable model of observed human behaviour of choices, for example, up weighting low probability events, and in particular distinguishing scenarios which are gains vs losses – we are all loss-adverse, and so put more negative weight on losing something than positive weight on gaining something.
Biology – like all sciences – is an observation based science, but perhaps more so than many others. Life is so diverse that the first task of any investigation is simply looking and recording biological phenomena. Very often even the simple process of observation will lead to a profound understanding of a biological component or system – perhaps more importantly it is observation and measurment which form the raw material for coming up with new hypotheses of how things work; usually these are then tested by perturbing the system in some experimental way, and repeating the observation or measurement (rarely one relies only on observation – I’ve blogged about this earlier). Much of the advances in biology came from the process of observing and cataloging, and then asking how to explain the catalog.
Around a year ago I wrote three blog posts about DNA sequence compression, based on the paper my student and I published early in 2011 and the start of cramtools by Guy Cochrane’s ENA team under the engineering lead of Rasko Leinonen, which is the practical implementation of these ideas. The previous blog posts first describe the thought process of compression, in particular the shift from understanding what information you wish to keep to what information you know you are happy to discard, then the practicalities of compression and the balancing between the different communities in the tool chain that need to work with it, and then finally whether compression is a credible answer for the DNA sequencing technology that doubles its cost effectiveness somewhere between every 6 to 12 months.
As more and more of research into human health is switching to human subject research, including a considerable amount of molecular work, molecular biologists and bioinformaticians are going to have to get far more comfortable with epidemiology than ever before. There are some big elephant trap like mistakes you can make – and indeed I see people making them right now (and, embarrassingly, I’ve stepped in one of two of these traps myself; one lives and learns ). One reason why there are so many traps for the unwary is because our first collective foothold in epidemiology is (in effect) genome wide association studies. This has the rare property that one of the key variables being measured (genotypes) both (a) do not change over time and (b) are relatively uniformly distributed relative to other things we can measure about people. I’ll return to this below.
I had the pleasure of reading this email from the mercurial Claudio Stern on a chicken list about trying to source a new contrast agent for chicken embryo research. This was once a particular type of ink (note the detail – totally understandable – of both the previous and suggested new system).
Continue reading “Ink, Squids and Chicken development”