EBI as a data refinery

In describing what the EBI does, it is sometimes hard to provide a feel for the complexity and detail of our processes. Recently I have been using an analogy of  EBI as a “data refinery”: it takes in raw data streams (“feedstocks”), combines and refines them, and transforms them into multi-use outputs (“products”). I see it as a pristine, state-of-the-art refinery with large, imposing central tanks (perhaps with steam venting here and there for effect) from which massive pipes emerge, covered in reflective cladding and connected in refreshingly sensible ways. In the foreground are arrayed a series of smaller tanks and systems, interconnected in more complex ways. Surrounding the whole are workshops, and if you look close enough you can see a group of workers decommissioning one part of the system
whilst another builds a new one.

Continue reading “EBI as a data refinery”

West meets East

I’ve just come back from around 10 days in China, visiting Nanjing, Shanghai and Hong Kong, and have a whole new perspective on this part of the world. I was not able to work Beijing into my trip this time, which was frustrating because I know there is a lot of good science happening there.

What was really different about this trip was that I came away feeling much more of a connection to China. It was great to meet new people and to renew more longstanding scientific contacts – but I also had more time (and, perhaps more importantly, more confidence) to travel between cities, have breakfast in local cafes rather than hotels, and generally get to know each place a little better. Previous trips (this was my fourth) required such a packed schedule that jetlag and the whole novelty of China completely dominated my experience.

Now that I’m sitting down to write about the experience, the first thing I’m inclined to do is draw some analogies with western countries. But analogies only go so far – even when they fit relatively well, they break down in the face of China’s distinct character. I do feel more knowledgeable than I have after previous visit to China, but I fully expect that future visits will reveal further dimensions and
facets to this immense and complex country.

On some level, China reminds me of the US: it’s a huge country with vast distances to travel between locations, and has a tremendously strong sense of a single nation. Everyone I met considered themselves “Chinese”, and there is a strong sense of a binding history and cultural underpinning. Also, similar to the US, China (and Chinese…) is aware of its size and economic power, and is conscious of having strong voice on the world stage. Hong Kong, Shanghai and Beijing are cosmopolitan cities, with a sometime exuberant celebration of the past 20 years economic growth. I won’t stray into geopolitics – it’s not my field of expertise at all – but a country of this size with sophisticated metroplotian areas will almost certainly make a big impact on science over the next couple of decades.

China shares some features with Europe – notably a diversity of language and culture across many provinces. Chinese provinces are often larger than European countries, and often have similar overall GDP. The many Chinese “dialects” are better described as different spoken languages, but importantly they share a set of written characters (with some modifications).  The implications of having a universally comprehensible written language for such a range of linguistic groups are profound.

My initial impression was that China had two major languages – Cantonese (used around Guandong and Hong Kong) and Mandarin – with various dialects, but this trip really impressed upon me just how diverse the linguistic landscape of Mainland China is. For example, Shanghaise is a dialect of Wu, which is a language family predominant in the eastern central area. When I was out for dinner in Shanghai with a Mandarin speaker, the waiter spoke to us in this lilting tone (Shanghaiese, as it turned out) and I turned to my companion for translation; she smiled, shrugged her shoulders and shifted the conversation to Mandarin. It was like dining with an Italian colleague in Finland and thinking she would know Finnish.

I’m much more aware now of the distinctive character and cultures of China’s provinces, which, along with the importance of personal networks, resonates with Europe.

While it’s fun to draw familiar parallels, China is clearly nothing like a mixture of the US and Europe. It is hard enough to completely understand the historical perspectives and cultures of one’s neighbours – it is going to be a long time before I will completely grasp the fundamental complexities of China. What I can say now is that its diversity is more and more fascinating to me, and something to be celebrated.

I wrote some time ago about scientific collaboration with China (see East meets West ), focusing on the positive aspects of openness and collaboration in engaging with this and other emerging economies (i.e. Brazil, India, Russia and Vietnam). As scientists, we have the good fortune of being expected to share scientific advances, discuss collaborations, discover new things jointly because they are the right thing to do – socially and strategically.

China already has some leading scientists and excellent scientific institutions, and I am sure this will only grow in the future. But communication is an essential component of community, and social media has been highly beneficial in keeping information flowing in much of the global scientific community. It’s frustrating that news platforms like Twitter are blocked in China. The EBI has set up a Weibo account (www.weibo.com/emblebi) where we will be posting (in English!) news items from the EBI. Hopwfully this help keep scientists in China up to date with developments at the EBI – so please do distribute to your Chinese colleagues.

On a more personal note, I’ve discovered that my first name (Ewan) is pronounced (in some dialects) almost identically to Yuan (a Chinese word for money). In Wikipedia, one of the pronunciation descriptions of Yuan is written identically to one of Ewan (what more proof do you need!) but I am not clear (a) if this is a variation in pronouncing Yuan in Mandarin or a dialect shift and (b) what tonal form it has. I’d be delighted to get some sort of linguistic survey of Yuan forms geo-tagged across China. People who have read my name sometimes get confused because they have a pre-formed idea of how to pronounce it (often “Evan” or “Ee-Wan” – one to save for my next Star Wars role).

So it’s useful to know that I can say, “Ewan, like Money, Yuan,” and this will provide some relief to my new acquaintance, who can file the name alongside a well-known phrase. (And before you say it, I know that I am just as bad when it comes to pronouncing some names – Chinese or not – in other languages!)

So – I’m “Money” Birney. I can’t quite work out whether I should be proud or a bit worried about this moniker.

Many, many thanks to my hosts and the new people I met on this journey: Ying, Philipp, Jing, Jun, Huaming, Hong, Laurie, Scott and many others. I look forward to seeing you again, and learning more on my next trips to China.

Human genetics; a tale of allele frequencies, effect sizes and phenotypes

A long time ago I was on the edge of the debate about SNPs in the late 1990s; of whether there should be an investment in first discovering, and then characterising and then using many many biallelic markers to track down diseases. As is now obvious, this decision was taken (First the SNP Consortium, then the HapMap project and its successor, 1000 genomes, and then many Genome wide association studies). I was quite young at the time (in my mid to late twenties; I even had a earring at the start of this as I was a young, rebellious man) and came from a background of sequence analysis – so it was quite confusing I remember getting my head around all the different terminology and subtlies of the argument. I think it was Lon Cardon who patiently explained to me yet again the concepts and he finished by saying that the real headache was that there were just so many low frequency alleles that were going to be hidden and that was going to be a missed opportunity. I nodded at the time, adding yet one more confused concept in my head to discussions about r2 behaviours, population structure, effect size and recombination hotspots all of which didn’t sit totally comfortably in my head at the time.

Continue reading “Human genetics; a tale of allele frequencies, effect sizes and phenotypes”

ENCODE: My own thoughts

5 September  2012 – Today sees the embargo lift on the second phase of the ENCODE project and the simultaneous publication of 30 coordinated, open-access papers in Nature, Genome Research and Genome Biology as well as publications in Science, Cell, JBC and others. The Nature publication has a number of firsts: cross-publication topic threads, a dedicated iPad/eBook App and web site and a virtual machine.

Continue reading “ENCODE: My own thoughts”

Optical and Nucleic capture. The future of high density information capture in biology

Last week a positioning paper by Guy Cochrane, Chuck Cook and myself finally came out in Gigascience. It’s premise is rather simple: we are all going to have to get use to lossy compression of DNA sequence, and as lossy compression is variable (you can set it at a variety of levels), we will have to have a community consensus of how much one compresses different data. This really part of our 2 year process into making efficient compression a practical reality for DNA sequencing, which I’ve blogged on before, numerous times.

I encourage you to read the paper, but in this blog I want to explore more the analogies between imaging and DNA sequencing – which now are numerous.  I believe at the core biology (of all sorts) the majority of data gathering will either be optical or “nucleic”. (the third machine type probably being mass spectroscopy, if you are interested). As a colleague once said about molecular biology – the game in town is now to get your method to output a set of DNA molecules. If you can do that – you can scale.

The first question is to ask why these two technologies are so dominant. The first is that one is fundamentally trying to capture information – information about distributions of photons or information about the make up of molecules.  One is not trying to do something large and physical. This means that the mechanisms for detection can be excruiatingly sensitive. In the case of imaging, single photon sensitivity is almost routine on modern instruments, and with things like Super-resolution imaging, which is a whole bunch of tricks to (in effect) in effect convert multiple time-separated images of a static image into better spatial resolution (I saw a deeply impressive talk by a Post Doc in Jan Ellenberg’s group showing a remarkable resolution of the nuclear pore). But one does not need to have fancy tricks to make great use of imaging – rather mundane, cheap imaging is the mainstay of all sorts of molecular biology – drosophila embryo development would be a different world without it. In the case of DNA, the ability to sequence at scale has been the focus for the last 20 years, and still will remain so probably until a human genome is in the ~$1,000 or $100 zone – as expensive a serious Xray or MRI. But at the same time the other shift is to moving towards more real time systems (the “latency” of the big sequencers are probably the biggest drag – 3 weeks is your best case on the old HiSeqs) and to single molecule systems. People talk about real time as critical to the clinic, and certainly the difference between even 12 hours and 2 weeks is day and night (at 12 hours or less one can do the cycle within a 1 day stay, and start to impact in-time diagnosis), but faster cycle times will really change research as well. Going back to the information aspect of these two technologies, as one is trying to only get information out of these things the physical limits of the technologies are remarkably far away. Imaging is hitting some of these limits (though there is still plenty of space for innovation); 3rd generation DNA sequencers will get closer to some limits in DNA processing as well, but again, we have some way to go. The future is bright for both of these technologies.

The second similarity is just the mundane business of storage of the output of these technologies – they are high density information streams, and therefore have alot of inherent entropy – some of that entropy one wants to utilise – that’s the whole point – but there is also quite a bit of extra “field” (in imaging terms) or “other bits of the genome” (in DNA terms) which one often knows is going to be there, but is less interested in. Imaging has long led the area of data-specific compression, using at first a variety of techniques of transformation of the data from straightforward x,y layout of pixel intensities to ways which inherently capture the correlation between pixels, allowing for efficient lossless compression. But the real breakthroughs came with lossy compression, understanding that for alot of the pixels, a transformation which lost some information for a large gain in compressability where appropriate for uses. Although the tendancy is to think about lossy compression in terms of “visual” display-to-user uses, in fact many technical groups use a variety of lossy forms for their storage, choosing mainly the amount of loss appropriately (I’d be interested in experiences on this, and in particular whether people deliberately choose other lossy algorithms away from the JPEG family). But Video compression has really taken lossy compression into new directions, with complex between frame transformations and then lossy applications, in particular adaptive modelling.

When we started in DNA compression many people critiqued it that we “couldn’t beat established generic compression” or that certain compression forms we “already optimal”. This totally misses the point – generic and optimal compression schemes are only generic and optimal for a particular data model, and to be generic, that data model involves a byte-stream. One doesn’t hear people saying about video compression “oh well, that problem has been solved generically and there are optimal compression methods” – putting a set of raw TIFFs straight into a byte-based compressor would not do very well. The key thing is first a data transformation that makes explicit correlation in the data for standard generic methods to compress (in the case of DNA, reference based alignment provides a sensible realisation of the redundancy between bases in a high coverage sample, and for a low coverage sample realises the redundancy with respect to previous samples). The second thing is the insight that not all the precision of the data is needed for interpretation. Interestingly lossy compression makes you think about the problem as the inverse of the normal thought process – often you ask “what information am I looking for” for some biological process – SNPs, or structural variants. Lossy compression methods inverts to the problem to ask “what information are you pretty sure you don’t need”. For example, when you know your photon detector will generate some random noise in particular patterns, having a lossy compression remove that entropy is highly unlikely to effect downstream analysis. Similarly when we can confidently recognise an isolated sequencing error, degrading the entropy of the quality score of the base is unlikely to change downstream analysis.  I’ve enjoyed learning more about image compression, and I think we’ve only started in DNA compression – at the moment we can 2 to 4 fold compression compared to standard methods with a clearly acceptable lossy mode (acceptable because the machine manufactures sort of know that they are generating a little too much precision in their quality scores). But with more aggressive techniques we can already think about 50 to 100 fold compression – though this is getting quite lossy. But this is not the end of the road here – I reckon we could be at 1,000 fold more compressed in the future.

The third similarity is the intensity in informatics in the processing. Both for image analysis and DNA analysis there are some standard tools (segmentation, hull finding, texture characterisation in imaging; alignment, assembly, variant calling in DNA sequence analysis) but how these tools are deployed is very experiment specific. There is not some “generic image analysis pipeline” any more than there is a “generic DNA analysis pipeline”. One has to choose particular analysis routes mainly driven by the experiment that was performed, and then to some extent for the output you want to see. This means that the bioinformatician must have a good mastery of the techniques. I have to admit, although I live and breathe DNA analysis, often developing new tools, I am pretty naive about image analysis – not that that’s stopping me diving in with my students in using (but not developing…) image analysis.  I think we’re not making image analysis enough of a mainstream skill set in bioinformatics, and this needs to change.

Finally the cheapness and ubiquity of imaging has meant that from the start image based techniques had to think carefully about which images one would store and at what compression. Clearly DNA sequencing is heading the same way, and this is the paper that Guy and myself put forward. Similarly to imaging, the key question is what is the overall cost of replacing the experiment, not the details of how much the image itself cost. So – a rare sample (such as a Neanderthal bone) is very hard to repeat the experiment – you need to store that information at high fidelity. But a routine mouse sequencing chip-seq is far more reproducible and one can be far more aggressive on compression. I actually think it has been to the detriment of biological imaging that there has not be a good, reference archive – probably because of this problem is knowing which things it is worth archiving coupled with the awesome diversity of uses for imaging – but projects like EuroBioImaging I think will provide the first (in this case federated) archiving process.

Over the next decade then I see ‘imaging’ and ‘dna sequencing’ converging more and more. Time to learn some image analysis basics (does anyone know a good book on the topic that geeky and detailed but starts at the basics?

Galois and Sequencing

It is not often anyone will hear the phrase “Galois field” and “DNA” together, but this paper from my colleagues, Tim Massingham and Nick Goldman provide a great link between these topics. Some other authors have used Galois fields in DNA analysis, but this is the first time I have seen a practical application of this level of mathematics in bioinformatics. It’s a tour de force by Tim, and although only in a lowly BMC Bioinformatics journal I think should be celebrated for its sheer chuptaz in cross scientific – indeed academic – domains.

Continue reading “Galois and Sequencing”

Thinking, Fast and Slow – Scientists are human too

I’ve just finished reading the excellent book “Thinking, Fast and Slow” by Daniel Kahneman, who is a psychologist who had a profound impact on economics; he won the Nobel Prize in economics in 2002 for “Prospect theory”, which basically tries to provide a reliable model of observed human behaviour of choices, for example, up weighting low probability events, and in particular distinguishing scenarios which are gains vs losses – we are all loss-adverse, and so put more negative weight on losing something than positive weight on gaining something.

Continue reading “Thinking, Fast and Slow – Scientists are human too”

Data curation; the power of observation.

Biology – like all sciences – is an observation based science, but perhaps more so than many others. Life is so diverse that the first task of any investigation is simply looking and recording biological phenomena. Very often even the simple process of observation will lead to a profound understanding of a biological component or system – perhaps more importantly it is observation and measurment which form the raw material for coming up with new hypotheses of how things work; usually these are then tested by perturbing the system in some experimental way, and repeating the observation or measurement (rarely one relies only on observation – I’ve blogged about this earlier). Much of the advances in biology came from the process of observing and cataloging, and then asking how to explain the catalog.

Continue reading “Data curation; the power of observation.”