The symbiosis of engineering and research

Symbiosis is the biological process where two organisms cooperate to such an extent that they become so co-dependent it is best to think of them as one. The specific orchids flowering at the right time for a specific butterfly, or the interweaving of algae and fungi that make lichens, or all eukaryotes with one of the most successful symbiotic deals done on this planet – between our bacterial mitochondrial ancestors and our nucleated ancestors. At the core of any successful symbiosis is complementarity of functionality, meaning the partnership is far more successful than each player on their own.

Champagne for Vadim from EBI (left) and James from Sanger (right)
and the two teams

In bioinformatics we need creative and dedicated work – of different sorts – from researchers and engineers. I was reminded of this in the latest saga of sequence compression. Last month, the first CRAM-format submission to the European Nucleotide Archive (ENA) by a team at the Wellcome Trust Sanger Institute sparked a small celebration here on the Genome Campus (i.e., my handing over a bottle of champagne to the Sanger sequence core team, as promised),  and our novel compressed-sequence format formally entered full production mode with the CRAM 2.0 specification released this week.

A long time ago…

Three years ago, my student Markus Hsi-yang Fritz and I were kicking around ideas about DNA storage and rapid retrieval. EMBL-EBI as a whole was confronting the sharp rise in DNA sequencing production rates, largely because of next-generation sequencing. A common refrain was, “I paid more for my hard drive than for my sequencing.” Another was, “the rate of improvement in DNA sequencing technology is easily out-pacing disk drives”. People had started to despair that archiving DNA sequence was a hopeless case – seemingly obviously in the long term but also, but also, quite possibly, in the short term.

Necessity is the mother of invention, and we knew we had to find a practical alternative. Markus and I talked about representing DNA sequence in an abbreviated form, in relation to a reference sequence (i.e., storing only the differences from a known sequence). This would obviously be far, far more compact than raw sequence, particularly once one accepts that the order in which reads are represented in a file is not relevant.

From Research to Service, and back again.

But this only takes care of compression on a single and fairly superficial level. The far harder problem was how to handle quality information, and we started to chew away at that. This led to a Genome Research paper in 2011, outlining the CRAM concept. (I have blogged about the issue a few times since then.)

Now that the ENA is accepting CRAM-format submissions, the original idea has moved from the arena of research and proof-of-principle into the domain of production service.  Looking at the original Genome Research paper, there has both been a lot of work at the EBI and input from the community to making the CRAM format and toolkit viable. The final specification is tightly defined – and yet flexible – and the associated infrastructure is firmly in place to make CRAM work. The ‘proof of principle’ implementation described in the paper was written in Python, and we did not provide a separate definition for a compression format. It worked, but in no sense could it be used in any production environment, not least due to speed issues. We now have a separate, detailed specification and two different code bases: one in Java (from EMBL-EBI) and one in C (from the Sanger Institute).

Deeply geeky

The fact that the Sanger Institute committed to writing the C code was a critical step in the development CRAM. James Bonfield, the lead developer, is one of the true “deep geeks” of sequencing informatics. He started his career using the Staden package as a sequencing handling package for the MRC Laboratory of Molecular Biology (LMB), where Fred Sanger originally developed his breakthrough sequencing technique. James moved to Sanger and has been chipping away at the ‘coal face’ of sequencing ever since: Sanger-style, BAC finishing, Next Generation Sequencing (NGS) and beyond. He has probably grappled with every ‘edge case’ of sequence processing at least once, and most are old friends to him. When he won the Sequence Squeeze competition last year, he nobly – and sensibly – said that the best use of his code was to feed ideas into frameworks like CRAM.

Having James write a C read/write layer into CRAM tightened up the specification considerably. It also subtly shifted CRAM towards being more of a framework in which multiple compression routines can co-exist, rather than being fixated on one compression routine. This makes CRAM more similar to the video codecs, where the format (e.g. H.264) acts as a container that can have a variety of different compression schemes.  It was a pleasure watching James and the ENA’s Vadim Zalunin discuss the finer details of byte definitions, indexing and coding trade offs as the CRAM 2.0 specification took shape.

Making it work

The commitment of the Sanger Institute to using CRAM means that the entire ecosystem of using reference based compression must work for many use cases of the format. We need to have lightweight tools for users to specify and register references, and acknowledge that sometimes the reference will originate in-house.  To that end, Guy Cochrane’s ENA team worked with a Sanger team led by Tony Cox to develop a pragmatic ‘hash-lookup’ scheme that we believe will scale appropriately, as it is very compatible with local cacheing of information.

Markus actually came back for a reprise – and provided the (rather unglamorous but much-needed) test suites used by the Java (cramtools) and C (scramble – soon to be samtools) codebases. Good test suites that explicitly try to re-create all the annoying edge cases are critical for robust engineering – so a big thanks to Markus.

The invisible format

The Sanger Institute has an on-going commitment to developing sequence-level tools. In taking on the development leadership for samtools (originally developed by Li Heng at the Sanger), they are planning to put CRAM read/writing as a backend. The Java based cramtools is already compatible with Picard, and we worked with the Broad Institute such there was no show stopper in integration into GATK – we’re hopeful that CRAM read/writing will also be integrated into GATK (I have a promise of beer or chocolate for the GATK team).So, using CRAM will be as simple as upgrading samtools, or in the future, other toolkits. The vast majority of users will never have to know about the details of the compression format – just as we casually throw around video files between the Internet, laptops and mobile phones without worrying about formats.

Fit for purpose

The trajectory from research to production-level service has been (relatively) smooth but steep. The reference-based compression scheme in CRAM is what Markus and I published in Genome Research but there is a world of difference between the paper and the specification, code and ecosystem of CRAM. Vadim and James, two skilled programmers, have spent between them over four years working on specification and code bases. After being parsed by two different brains and going through independent implementations, CRAM has arrived at a robust and practical specification. The CRAM format is extensible, and some of the niggly implementation quirks of SAM/BAM have been cleared up (e.g. the requirement for reference sequences to be smaller than 512 Megabases, even though we know of a number larger sequences).

Research and service – a great symbiosis

It would be simplistic to see the original research as the ‘breakthrough’ and the next steps as ‘implementation’. If anything, the engineering details are more complex, more involved and require more nous than the research. All these components, working together, have been critical. If we took out any one of the people in this chain – Markus and me at the start; Vadim, Rasko and James engineering; Guy (EMBL-EBI) and Tony (Sanger) taking decisions and dedicating resource – it may never have happened.

There is a big difference between how research functions and how infrastructures operate. Sometimes the engineering hits a problem that cannot be simply “engineered around” using existing tools. Good, applied, computer-science research might find an in-theory solution, but that solution needs to be folded back into the engineering. All of this of course is to support biological research with a minimum of technical fuss.

Good research infrastructure pushes technical boundaries – and CRAM does just that. I am, needless to say, really proud to have been a part of it – but James and Vadim really earned those bottles of champagne.

Structural Biology – the business end of life.

As part of my Biochemistry degree at Oxford, I had to spend a year focusing on a single research project. My obsession with bioinformatics was already firmly established when Iain Campbell, a leading NMR spectroscopist and structural biologist, took me under his wing. At the time, structural biology was definitely the most computational area of molecular biology, so I was looking forward to getting stuck into a computational project.

Continue reading “Structural Biology – the business end of life.”

The 10,000 year archive

The task: store a substantial amount of digital information for a future civilization to access
DNA has a good chance of lasting  10 000 (or more) years so long as long as it is kept cold, dark and dry. And of course, DNA is incredibly dense: at least 1 petabyte can be stored in 1 gram of DNA, and that includes a lot of built-in redundancy. It’s a very good information storage molecule, and Nature has been pretty clever in choosing it.

Continue reading “The 10,000 year archive”

Using DNA as a digital archive media

Today sees the publication in Nature of “Toward practical high-capacity low-maintenance storage of digital information in synthesised DNA,” a paper spearheaded by my colleague Nick Goldman and in which I played a major part, in particular in the germination of the idea.

This is one of the moments in science that I love: an idea over a beer that ends up as a letter to Nature.

Continue reading “Using DNA as a digital archive media”

EBI as a data refinery

In describing what the EBI does, it is sometimes hard to provide a feel for the complexity and detail of our processes. Recently I have been using an analogy of  EBI as a “data refinery”: it takes in raw data streams (“feedstocks”), combines and refines them, and transforms them into multi-use outputs (“products”). I see it as a pristine, state-of-the-art refinery with large, imposing central tanks (perhaps with steam venting here and there for effect) from which massive pipes emerge, covered in reflective cladding and connected in refreshingly sensible ways. In the foreground are arrayed a series of smaller tanks and systems, interconnected in more complex ways. Surrounding the whole are workshops, and if you look close enough you can see a group of workers decommissioning one part of the system
whilst another builds a new one.

Continue reading “EBI as a data refinery”

West meets East

I’ve just come back from around 10 days in China, visiting Nanjing, Shanghai and Hong Kong, and have a whole new perspective on this part of the world. I was not able to work Beijing into my trip this time, which was frustrating because I know there is a lot of good science happening there.

What was really different about this trip was that I came away feeling much more of a connection to China. It was great to meet new people and to renew more longstanding scientific contacts – but I also had more time (and, perhaps more importantly, more confidence) to travel between cities, have breakfast in local cafes rather than hotels, and generally get to know each place a little better. Previous trips (this was my fourth) required such a packed schedule that jetlag and the whole novelty of China completely dominated my experience.

Now that I’m sitting down to write about the experience, the first thing I’m inclined to do is draw some analogies with western countries. But analogies only go so far – even when they fit relatively well, they break down in the face of China’s distinct character. I do feel more knowledgeable than I have after previous visit to China, but I fully expect that future visits will reveal further dimensions and
facets to this immense and complex country.

On some level, China reminds me of the US: it’s a huge country with vast distances to travel between locations, and has a tremendously strong sense of a single nation. Everyone I met considered themselves “Chinese”, and there is a strong sense of a binding history and cultural underpinning. Also, similar to the US, China (and Chinese…) is aware of its size and economic power, and is conscious of having strong voice on the world stage. Hong Kong, Shanghai and Beijing are cosmopolitan cities, with a sometime exuberant celebration of the past 20 years economic growth. I won’t stray into geopolitics – it’s not my field of expertise at all – but a country of this size with sophisticated metroplotian areas will almost certainly make a big impact on science over the next couple of decades.

China shares some features with Europe – notably a diversity of language and culture across many provinces. Chinese provinces are often larger than European countries, and often have similar overall GDP. The many Chinese “dialects” are better described as different spoken languages, but importantly they share a set of written characters (with some modifications).  The implications of having a universally comprehensible written language for such a range of linguistic groups are profound.

My initial impression was that China had two major languages – Cantonese (used around Guandong and Hong Kong) and Mandarin – with various dialects, but this trip really impressed upon me just how diverse the linguistic landscape of Mainland China is. For example, Shanghaise is a dialect of Wu, which is a language family predominant in the eastern central area. When I was out for dinner in Shanghai with a Mandarin speaker, the waiter spoke to us in this lilting tone (Shanghaiese, as it turned out) and I turned to my companion for translation; she smiled, shrugged her shoulders and shifted the conversation to Mandarin. It was like dining with an Italian colleague in Finland and thinking she would know Finnish.

I’m much more aware now of the distinctive character and cultures of China’s provinces, which, along with the importance of personal networks, resonates with Europe.

While it’s fun to draw familiar parallels, China is clearly nothing like a mixture of the US and Europe. It is hard enough to completely understand the historical perspectives and cultures of one’s neighbours – it is going to be a long time before I will completely grasp the fundamental complexities of China. What I can say now is that its diversity is more and more fascinating to me, and something to be celebrated.

I wrote some time ago about scientific collaboration with China (see East meets West ), focusing on the positive aspects of openness and collaboration in engaging with this and other emerging economies (i.e. Brazil, India, Russia and Vietnam). As scientists, we have the good fortune of being expected to share scientific advances, discuss collaborations, discover new things jointly because they are the right thing to do – socially and strategically.

China already has some leading scientists and excellent scientific institutions, and I am sure this will only grow in the future. But communication is an essential component of community, and social media has been highly beneficial in keeping information flowing in much of the global scientific community. It’s frustrating that news platforms like Twitter are blocked in China. The EBI has set up a Weibo account (www.weibo.com/emblebi) where we will be posting (in English!) news items from the EBI. Hopwfully this help keep scientists in China up to date with developments at the EBI – so please do distribute to your Chinese colleagues.

On a more personal note, I’ve discovered that my first name (Ewan) is pronounced (in some dialects) almost identically to Yuan (a Chinese word for money). In Wikipedia, one of the pronunciation descriptions of Yuan is written identically to one of Ewan (what more proof do you need!) but I am not clear (a) if this is a variation in pronouncing Yuan in Mandarin or a dialect shift and (b) what tonal form it has. I’d be delighted to get some sort of linguistic survey of Yuan forms geo-tagged across China. People who have read my name sometimes get confused because they have a pre-formed idea of how to pronounce it (often “Evan” or “Ee-Wan” – one to save for my next Star Wars role).

So it’s useful to know that I can say, “Ewan, like Money, Yuan,” and this will provide some relief to my new acquaintance, who can file the name alongside a well-known phrase. (And before you say it, I know that I am just as bad when it comes to pronouncing some names – Chinese or not – in other languages!)

So – I’m “Money” Birney. I can’t quite work out whether I should be proud or a bit worried about this moniker.

Many, many thanks to my hosts and the new people I met on this journey: Ying, Philipp, Jing, Jun, Huaming, Hong, Laurie, Scott and many others. I look forward to seeing you again, and learning more on my next trips to China.

Human genetics; a tale of allele frequencies, effect sizes and phenotypes

A long time ago I was on the edge of the debate about SNPs in the late 1990s; of whether there should be an investment in first discovering, and then characterising and then using many many biallelic markers to track down diseases. As is now obvious, this decision was taken (First the SNP Consortium, then the HapMap project and its successor, 1000 genomes, and then many Genome wide association studies). I was quite young at the time (in my mid to late twenties; I even had a earring at the start of this as I was a young, rebellious man) and came from a background of sequence analysis – so it was quite confusing I remember getting my head around all the different terminology and subtlies of the argument. I think it was Lon Cardon who patiently explained to me yet again the concepts and he finished by saying that the real headache was that there were just so many low frequency alleles that were going to be hidden and that was going to be a missed opportunity. I nodded at the time, adding yet one more confused concept in my head to discussions about r2 behaviours, population structure, effect size and recombination hotspots all of which didn’t sit totally comfortably in my head at the time.

Continue reading “Human genetics; a tale of allele frequencies, effect sizes and phenotypes”