Ewan's Blog: Bioinformatician at large

November 7, 2013June 26, 2017

Heterogeneity in Cancer genomics

I’ve just come back from a great meeting on Cancer Genomics, held at EMBL Heidelberg (full disclosure: I was an organiser, so no surprise I enjoyed the talks!)

The application of genomics to cancer has been progressing for a long time, but we are now in the era where “cheap enough” exome sequencing (and increasingly whole genome sequencing) is present for both fundamental cancer research and clinical research – and there is really a sense of starting to “mainstream” sequencing into clinical care (clinical care and clinical research seem closer in the Cancer field than some other areas of medicine).

Continue reading “Heterogeneity in Cancer genomics”

October 14, 2013June 26, 2017

CERN for molecular biologists

This September I visited CERN again, this time with a rather technical delegation from the EBI to meet with their ‘big physics data’ counterparts. Our generous hosts Ian Bird, Bob Jones and several experimental scientists showed us a great day, and gave us an extended opportunity to understand their data flow in detail. We also got a tour of the anti-matter experiments, which was very cool (though, sadly, it did not include a trip down to the main tunnels of the LHC).

CERN is a marvellous place, and it triggers some latent physics I learnt as an undergraduate. Sometimes the data challenges in CERN are used as a template for the data challenges across all of sciences in the future; I have come to learn that these analogies – unsurprisingly – are sometimes useful and sometimes not. To understand the data flow behind CERN, though, one needs to understand CERN itself in more detail.

Continue reading “CERN for molecular biologists”

October 7, 2013June 26, 2017

Freedom of Expression

Multicellular organisms are beautifully precise in the way their different component cells operate in different ways, despite each cell having the same genome. The main difference between cell types is what genes are expressed, and much of this is due to differential expression of RNA transcripts. The ability to measure all the transcripts simultaneously in a cell population with RNAseq has been very informative – but often complex. But this complexity is anchored: for most genes, there is one dominant transcript – and within an organism that major transcript is the same between tissues over half the time. This “glass-half-full” view of mRNA complexity was recently described by Alvis Brazma and his colleagues, and integrated into a new value added resource at the EBI, the Expression Atlas.

Continue reading “Freedom of Expression”

September 5, 2013June 26, 2017

California dreamin’

I can visit the east coast of the US every month without too much of a strain on my circadian rhythm, but I’ve learned to pack in as many visits as possible on the west coast. Last month I had the pleasure of catching up with old friends and meeting new people in the San Francisco Bay area. It is no doubt a great place for science, with commercial enterprise operating in a rich academic environment: UC Davis, with its agricultural science programme broadening into all life sciences; UC Berkeley, with its sublime blend of maths, statistics and molecular biology; UCSF’s new medical campus, with new and established PIs in molecular biology, development and medicine; the technology powerhouse of Stanford, right in the middle of Silicon Valley; the chilled UC Santa Cruz, with its world-leading computational biology; and the innovative, energy-
focused Lawrence Berkeley National Laboratory and Joint Genome Institute.
Continue reading “California dreamin’”

August 17, 2013June 26, 2017

5 reasons to love logarithms

I was discussing with a maths minded friend about the difference between “quantitative” and “non quantitative” science, mainly on how biology had to get its quantitative mojo back, and I said that a good proxy for whether someone was “quantitative” or not is whether they are at home with logarithms – do they use them, are they comfortable about logs between 0 and 1, can they read log plots?

Continue reading “5 reasons to love logarithms”

June 14, 2013June 26, 2017

The symbiosis of engineering and research

Symbiosis is the biological process where two organisms cooperate to such an extent that they become so co-dependent it is best to think of them as one. The specific orchids flowering at the right time for a specific butterfly, or the interweaving of algae and fungi that make lichens, or all eukaryotes with one of the most successful symbiotic deals done on this planet – between our bacterial mitochondrial ancestors and our nucleated ancestors. At the core of any successful symbiosis is complementarity of functionality, meaning the partnership is far more successful than each player on their own.

Champagne for Vadim from EBI (left) and James from Sanger (right)
and the two teams

In bioinformatics we need creative and dedicated work – of different sorts – from researchers and engineers. I was reminded of this in the latest saga of sequence compression. Last month, the first CRAM-format submission to the European Nucleotide Archive (ENA) by a team at the Wellcome Trust Sanger Institute sparked a small celebration here on the Genome Campus (i.e., my handing over a bottle of champagne to the Sanger sequence core team, as promised), and our novel compressed-sequence format formally entered full production mode with the CRAM 2.0 specification released this week.

A long time ago…

Three years ago, my student Markus Hsi-yang Fritz and I were kicking around ideas about DNA storage and rapid retrieval. EMBL-EBI as a whole was confronting the sharp rise in DNA sequencing production rates, largely because of next-generation sequencing. A common refrain was, “I paid more for my hard drive than for my sequencing.” Another was, “the rate of improvement in DNA sequencing technology is easily out-pacing disk drives”. People had started to despair that archiving DNA sequence was a hopeless case – seemingly obviously in the long term but also, but also, quite possibly, in the short term.

Necessity is the mother of invention, and we knew we had to find a practical alternative. Markus and I talked about representing DNA sequence in an abbreviated form, in relation to a reference sequence (i.e., storing only the differences from a known sequence). This would obviously be far, far more compact than raw sequence, particularly once one accepts that the order in which reads are represented in a file is not relevant.

From Research to Service, and back again.

But this only takes care of compression on a single and fairly superficial level. The far harder problem was how to handle quality information, and we started to chew away at that. This led to a Genome Research paper in 2011, outlining the CRAM concept. (I have blogged about the issue a few times since then.)

Now that the ENA is accepting CRAM-format submissions, the original idea has moved from the arena of research and proof-of-principle into the domain of production service. Looking at the original Genome Research paper, there has both been a lot of work at the EBI and input from the community to making the CRAM format and toolkit viable. The final specification is tightly defined – and yet flexible – and the associated infrastructure is firmly in place to make CRAM work. The ‘proof of principle’ implementation described in the paper was written in Python, and we did not provide a separate definition for a compression format. It worked, but in no sense could it be used in any production environment, not least due to speed issues. We now have a separate, detailed specification and two different code bases: one in Java (from EMBL-EBI) and one in C (from the Sanger Institute).

Deeply geeky

The fact that the Sanger Institute committed to writing the C code was a critical step in the development CRAM. James Bonfield, the lead developer, is one of the true “deep geeks” of sequencing informatics. He started his career using the Staden package as a sequencing handling package for the MRC Laboratory of Molecular Biology (LMB), where Fred Sanger originally developed his breakthrough sequencing technique. James moved to Sanger and has been chipping away at the ‘coal face’ of sequencing ever since: Sanger-style, BAC finishing, Next Generation Sequencing (NGS) and beyond. He has probably grappled with every ‘edge case’ of sequence processing at least once, and most are old friends to him. When he won the Sequence Squeeze competition last year, he nobly – and sensibly – said that the best use of his code was to feed ideas into frameworks like CRAM.

Having James write a C read/write layer into CRAM tightened up the specification considerably. It also subtly shifted CRAM towards being more of a framework in which multiple compression routines can co-exist, rather than being fixated on one compression routine. This makes CRAM more similar to the video codecs, where the format (e.g. H.264) acts as a container that can have a variety of different compression schemes. It was a pleasure watching James and the ENA’s Vadim Zalunin discuss the finer details of byte definitions, indexing and coding trade offs as the CRAM 2.0 specification took shape.

Making it work

The commitment of the Sanger Institute to using CRAM means that the entire ecosystem of using reference based compression must work for many use cases of the format. We need to have lightweight tools for users to specify and register references, and acknowledge that sometimes the reference will originate in-house. To that end, Guy Cochrane’s ENA team worked with a Sanger team led by Tony Cox to develop a pragmatic ‘hash-lookup’ scheme that we believe will scale appropriately, as it is very compatible with local cacheing of information.

Markus actually came back for a reprise – and provided the (rather unglamorous but much-needed) test suites used by the Java (cramtools) and C (scramble – soon to be samtools) codebases. Good test suites that explicitly try to re-create all the annoying edge cases are critical for robust engineering – so a big thanks to Markus.

The invisible format

The Sanger Institute has an on-going commitment to developing sequence-level tools. In taking on the development leadership for samtools (originally developed by Li Heng at the Sanger), they are planning to put CRAM read/writing as a backend. The Java based cramtools is already compatible with Picard, and we worked with the Broad Institute such there was no show stopper in integration into GATK – we’re hopeful that CRAM read/writing will also be integrated into GATK (I have a promise of beer or chocolate for the GATK team).So, using CRAM will be as simple as upgrading samtools, or in the future, other toolkits. The vast majority of users will never have to know about the details of the compression format – just as we casually throw around video files between the Internet, laptops and mobile phones without worrying about formats.

Fit for purpose

The trajectory from research to production-level service has been (relatively) smooth but steep. The reference-based compression scheme in CRAM is what Markus and I published in Genome Research but there is a world of difference between the paper and the specification, code and ecosystem of CRAM. Vadim and James, two skilled programmers, have spent between them over four years working on specification and code bases. After being parsed by two different brains and going through independent implementations, CRAM has arrived at a robust and practical specification. The CRAM format is extensible, and some of the niggly implementation quirks of SAM/BAM have been cleared up (e.g. the requirement for reference sequences to be smaller than 512 Megabases, even though we know of a number larger sequences).

Research and service – a great symbiosis

It would be simplistic to see the original research as the ‘breakthrough’ and the next steps as ‘implementation’. If anything, the engineering details are more complex, more involved and require more nous than the research. All these components, working together, have been critical. If we took out any one of the people in this chain – Markus and me at the start; Vadim, Rasko and James engineering; Guy (EMBL-EBI) and Tony (Sanger) taking decisions and dedicating resource – it may never have happened.

There is a big difference between how research functions and how infrastructures operate. Sometimes the engineering hits a problem that cannot be simply “engineered around” using existing tools. Good, applied, computer-science research might find an in-theory solution, but that solution needs to be folded back into the engineering. All of this of course is to support biological research with a minimum of technical fuss.

Good research infrastructure pushes technical boundaries – and CRAM does just that. I am, needless to say, really proud to have been a part of it – but James and Vadim really earned those bottles of champagne.

April 9, 2013June 26, 2017

Structural Biology – the business end of life.

As part of my Biochemistry degree at Oxford, I had to spend a year focusing on a single research project. My obsession with bioinformatics was already firmly established when Iain Campbell, a leading NMR spectroscopist and structural biologist, took me under his wing. At the time, structural biology was definitely the most computational area of molecular biology, so I was looking forward to getting stuck into a computational project.

Continue reading “Structural Biology – the business end of life.”

March 4, 2013June 26, 2017

The EBI’s new websites…

Two years ago a revolution started within the EBI. We took a good, hard look at all of our interfaces, and decided to put our users at the centre of our design. The first fruits of this revolution are out today, as our fleet of websites is re-launched with a unified look and feel and consistent navigation. The “top” page is at www.ebi.ac.uk

Continue reading “The EBI’s new websites…”

January 23, 2013June 26, 2017

The 10,000 year archive

The task: store a substantial amount of digital information for a future civilization to access

DNA has a good chance of lasting 10 000 (or more) years so long as long as it is kept cold, dark and dry. And of course, DNA is incredibly dense: at least 1 petabyte can be stored in 1 gram of DNA, and that includes a lot of built-in redundancy. It’s a very good information storage molecule, and Nature has been pretty clever in choosing it.

Continue reading “The 10,000 year archive”

January 23, 2013June 26, 2017

Using DNA as a digital archive media

Today sees the publication in Nature of “Toward practical high-capacity low-maintenance storage of digital information in synthesised DNA,” a paper spearheaded by my colleague Nick Goldman and in which I played a major part, in particular in the germination of the idea.

This is one of the moments in science that I love: an idea over a beer that ends up as a letter to Nature.

Continue reading “Using DNA as a digital archive media”