I’ve just passed my 10,000th follower on Twitter, and similar to when I went past 5,000 followers this feels like a good point to reflect on this open, ‘blog-and-tweet’ world evolving around me.Many of the comments I made two years ago have stood the test of time: Twitter is still fundamentally a conversation, broadcast not just to your lunch queue but worldwide, and blogs remain lightweight, informal platforms for review and commentary. And as with any conversation you have to consider your audience first, and as with all public writing everyone still need and editor [sic].
Anatomy of a mainstream science piece
Last week, the Guardian published a Comment by me entitled, ‘Why I’m sceptical about the idea of geneticallyinherited trauma‘. In this blog post, I’d like to go through what happened behind the scenes when someone from the mainstream press asked for my views, what my thought process was before I started drafting a response, and why I believe we should all participate more in public discourse on science.
Human as a model organism
Model organisms have provided the foundation for building our understanding of life, including human disease. Homo sapiens has joined this select group, adding knowledge we can apply to our myriad companion species. But to resolve even one small part of the moving, shifting puzzle of life, we need them all.
Untangling Big Data
“Big Data” is a trendy, catch-all phrase for handling large datasets in all sorts of domains: finance, advertising, food distribution, physics, astronomy and molecular biology – notably genomics. It means different things to different people, and has inspired any number of conferences, meetings and new companies. Amidst the general hype, some outstanding examples shine forth and today sees an exceptional Big Data analysis paper by a trio of EMBL-EBI research labs – Oliver Stegle, John Marioni and Sarah Teichmann – that shows why all this attention is more than just hype.
Moving 20 Petabytes
EMBL-EBI’s data resources are built on a constantly running compute and storage infrastructure. Over the past decade that infrastructure has grown exponentially, keeping pace with the rapid growth of molecular data and the corresponding need for computation. Terabytes of data flow every day on and off our storage systems, making up the hidden life-blood of data and knowledge that permeates much of modern molecular biology. There is a somewhat bewildering complexity to all of this. We have 57 key resources: everything from low-level, raw DNA storage (ENA) through genome analysis (Ensembl and Ensembl Genomes), complex knowledge systems (UniProt) and 3D protein structures (PDBe). At minimum, over half a million users visit at least one of the EMBL-EBI websites each month, making 12 million web hits and downloading 35 Terabytes each day. Each resource has its own release cycle, with different international collaborations (e.g. INSDC, wwPDB, ProteomeXchange) handling the worldwide data flow.
Another step into Quantitative Genetics…
Today sees the publication of the paper by Zhihao Ding, YunYun Ni, Sander Timmer and colleagues (including myself) on local sequence effects and different modes of X-chromosome association as revealed by the quantitative genetics of CTCF binding. This paper represents the joint work of three main groups: Richard Durbin’s at the Sanger Institute, Vishy Iyer’s at U. Texas, Austin and my own at EMBL-EBI. I’m delighted that this work from Zhihao, YunYun and Sander (the three co-first authors) that it’s finally come out, and want to share some aspects of the work that were particularly interesting to me.
RNA is now a first class bioinformatics molecule.
RNA research is expanding very quickly, and a public resource for these extremely valuable datasets has been long overdue.
Continue reading “RNA is now a first class bioinformatics molecule.”
A cheat’s guide to histone modifications
I was recently having lunch with Sandro, a charming Neapolitan computer science graduate doing a postdoc in my research group, who has a passion for great food and clean C code. We were discussing some recent aggregation results of histone modifications, and Sandro was bemoaning (verbally and non-verbally) the fact that all the histone modifications sounded “just the same”. I could relate to the sentiment, recalling my own journey into this world some seven years ago during the start of the ENCODE project when I first faced this bamboozling list of modifications.
CRAM goes mainline
Two weeks ago there was the announcement from John Marshall from Sanger for SAMtools 1.0 – one of the two most widely used Next Generation Sequencing (NGS) variant-calling tools embedded in hundreds if not thousands of bioinformatics pipelines worldwide. (The majority of germline variant calling happens either through SAMtools or the Broad’s GATK toolkit.) SAMtools was started at the Sanger Institute by Li Heng when he was in Richard Durbin’s group, and stayed at Sanger now under the watchful eye of Thomas Keene.
Scaling up bioinformatics training online
Bioinformatics has grown very quickly since the EBI opened 20 years ago, and I think it’s fair to say that it will grow even faster over the next 20 years. Biology is being transformed to a fundamentally information-centric science, and a key part of this has been the aggregation of knowledge in large-scale databases. When you put all the hard-won information about living systems together – their genome sequences, variation, proteins, interactions with small molecules – they are, potentially, incredibly useful. I say “potentially” because even the most pristine, large, interconnected data collection in the world isn’t worth much if people don’t know how to use it.
Continue reading “Scaling up bioinformatics training online”